Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Buffer with Uint8Array #452

Merged
merged 36 commits into from
Sep 21, 2024
Merged

Replace Buffer with Uint8Array #452

merged 36 commits into from
Sep 21, 2024

Conversation

valadaptive
Copy link
Contributor

I still need to do some cleanups and replace uses of Buffer.compare, but I'm putting this PR here so you can benchmark it. Before I can complete this, #428 (minus the version bump) should be merged in, as the services code uses Buffer everywhere and expects fixed/bytes types to decode to Buffers.

I've made some tweaks to the benchmarking setup, so comparing this directly to the master branch won't work:

  • I've added an ArrayFloat benchmark to go along with ArrayDouble.
  • I use the --expose-gc flag when benchmarking in order to manually trigger garbage collection between benches, hopefully making results more consistent.
  • I've changed the length distribution of strings to be exponentially weighted, so shorter strings are still likely but longer strings will now be occasionally generated. The previous code was only benchmarking the manual path of Tap#writeString, since it only generated strings up to a length of 32.

I've cherry-picked those benchmarking changes into the bench-tweaks branch, which you can use to compare benchmarks.

@mtth
Copy link
Owner

mtth commented Mar 4, 2024

Thanks @valadaptive! I'll try to find time to merge #428.

@mtth
Copy link
Owner

mtth commented Mar 30, 2024

FYI @valadaptive - #428 is in.

@valadaptive valadaptive force-pushed the debufferify branch 3 times, most recently from 6f39bde to bd70cce Compare March 30, 2024 21:17
@valadaptive
Copy link
Contributor Author

Working on removing Buffer usage from types.js now. I noticed the isJsonBuffer function, which seems to check if a given object is the JSON representation of a Buffer. Under what circumstances are Avro types directly serialized to JSON and/or parsed back directly? I can't easily polyfill Uint8Array to stringify to a regular array, so I'll probably need to insert a fixup step when stringifying/parsing.

@valadaptive
Copy link
Contributor Author

@mtth Do you intend to support the ability to roundtrip various Avro types to/from JSON? With the current representation that uses Buffers, this works mostly fine, but with Uint8Array, the JSON representation is a lot more bloated:

> JSON.stringify(Buffer.from([1, 2, 3, 4, 5]))
'{"type":"Buffer","data":[1,2,3,4,5]}'
> JSON.stringify(new Uint8Array([1, 2, 3, 4, 5]))
'{"0":1,"1":2,"2":3,"3":4,"4":5}'

I lean towards removing the coerceBuffers option entirely to discourage people from serializing Uint8Arrays to JSON.

@mtth
Copy link
Owner

mtth commented Sep 13, 2024

Do you intend to support the ability to roundtrip various Avro types to/from JSON?

It's nice to have, but I'm OK dropping coerceBuffers if it adds significant complexity.

Copy link
Owner

@mtth mtth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the slow review. Overall this looks great but I would like more time to cover all the changes. I'm sending a first batch of comments now, mostly questions, but feel free to wait until I send the next batch (hopefully this weekend) to respond.

etc/scripts/perf Show resolved Hide resolved
lib/types.js Outdated Show resolved Hide resolved
lib/utils.js Show resolved Hide resolved
lib/utils.js Outdated Show resolved Hide resolved
@valadaptive
Copy link
Contributor Author

It's nice to have, but I'm OK dropping coerceBuffers if it adds significant complexity.

I believe it works right now in terms of recognizing Buffers. However, since we now use Uint8Arrays instead of Buffers in types, they'll serialize to much larger objects in JSON (e.g. Uint8Array([1, 2, 3, 4, 5] is serialized as '{"0":1,"1":2,"2":3,"3":4,"4":5}').

I can either try to implement code to recognize JSON'd Uint8Arrays, or remove coerceBuffers entirely. Leaving it as-is would be a huge footgun, because you can no longer round-trip Avro types through JSON (it'll serialize Uint8Arrays to a JSON representation it cannot itself recognize as a Uint8Array).

@joscha
Copy link
Contributor

joscha commented Sep 13, 2024

Short reference to https://gist.github.com/joscha/d8603a1f0af5b0b055546c792b2a8ff6, which can be updated once this pull request lands.

Copy link
Owner

@mtth mtth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor comments, and I expect this will be good to go.

lib/types.js Outdated Show resolved Hide resolved
lib/types.js Outdated
Comment on lines 1080 to 1094
return RANDOM.nextString(
Math.floor(-Math.log(RANDOM.nextFloat()) * 16) + 1
);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I touched on this in the PR description above, but this is solely for benchmarking purposes--when rewriting and optimizing the string encoding/decoding functions, I wanted to make sure I exercised both the "short string" and "long string" code paths. This is an exponential distribution, which means that shorter strings are more likely but longer ones are possible.

I can revert this, although in the long term it might be better for Type#random to be moved into the testing code, since I'm not sure what purpose it serves to users of the library.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the background. This is fine. Agreed that this method would be best moved out of the core types, but this is best done separately.

lib/types.js Show resolved Hide resolved
lib/utils.js Outdated Show resolved Hide resolved
Comment on lines +839 to +852
// The maximum number that a signed varint can store in a single byte is 63.
// The maximum size of a UTF-8 representation of a UTF-16 string is 3 times
// its length, as one UTF-16 character can be represented by up to 3 bytes
// in UTF-8. Therefore, if the string is 21 characters or less, we know that
// its length can be stored in a single byte, which is why we choose 21 as
// the small-string threshold specifically.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the great comments, here and throughout.

return new Tap(buf);
}

toBuffer () {
Copy link
Owner

@mtth mtth Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Out of scope for this PR.) It's surprising to have "buffer" functions return typed arrays now. It will be good to update the methods throughout the package to have consistent names later on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely something that should be done before a breaking 6.0 release. I contemplated doing it as part of this PR, but I'm not sure what terminology would be best. toBinary sounds like it could refer to a binary string. Maybe toTypedArray or toBytes? Maybe rename Type#encode to Type#encodeInto a la TextEncoder and repurpose the old method name?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An "Into" suffix for encode sounds good; I like the consistency with TextEncoder. binaryEncode (and jsonEncode) might be fine, they mirror the wording in the Avro specification.

@valadaptive valadaptive marked this pull request as ready for review September 19, 2024 01:14
Copy link
Owner

@mtth mtth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is a great step forward - thank you @valadaptive. The performance improvements are particularly impressive.

I'll follow up with a refactor, renaming methods to make them consistent with the new types.

return new Tap(buf);
}

toBuffer () {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An "Into" suffix for encode sounds good; I like the consistency with TextEncoder. binaryEncode (and jsonEncode) might be fine, they mirror the wording in the Avro specification.

lib/types.js Outdated
Comment on lines 1080 to 1094
return RANDOM.nextString(
Math.floor(-Math.log(RANDOM.nextFloat()) * 16) + 1
);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the background. This is fine. Agreed that this method would be best moved out of the core types, but this is best done separately.

@mtth mtth merged commit c80c670 into mtth:master Sep 21, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants