This is a comparison and benchmark of various binary serialization formats and libraries used in JavaScript as of 2020-07-28.
I was myself trying to decide what binary serialization format I should use in my personal projects, and what started out simple soon turned into a rather extensive comparison.
By sharing my findings, I hope it can be of help (and save time) to someone in a similar situation and perhaps inspire some developers to try out binary serialization.
This article and benchmark attempts to answer what binary serialization library to use with regard to performance, compression size and ease of use.
The following formats and libraries are compared:
- Protocol Buffer:
protobuf-js
,pbf
,protons
,google-protobuf
- Avro:
avsc
- BSON:
bson
- BSER:
bser
- JSBinary:
js-binary
Based on the current benchmark results in this article, the author would rank the top libraries in the following order (higher values are better, measurements are given as x times faster than JSON):
avsc
: 10x encoding, 3-10x decodingjs-binary
: 2x encoding, 2-8x decodingprotobuf-js
: 0.5-1x encoding, 2-6x decoding,pbf
: 1.2x encoding, 1.0x decodingbser
: 0.5x encoding, 0.5x decodingbson
: 0.5x encoding, 0.7x decoding
Ranked by encoded size (compared to JSON) only:
avsc
,js-binary
: 32%:protobuf-js
,pbf
,protons
,google-protobuf
: 42%bser
: 67%bson
: 79%JSON
: 100%
Due to various reasons outlined in the article, the author would not currenly recommend the following libraries:
protons
: 2x encoding, 0.05x decodinggoogle-protobuf
: 0.3-0.5x encoding, 0.8x decoding
Feel free to skip to the Conclusion sections of the article to read the summarized motivation. For performance graphs and detailed measurements skip to Result (final). To reproduce the measurements, skip to Setup.
- Introduction
- Setup
- Disclaimer
- Libraries
- Benchmark
- Result (Protocol Buffers)
- Result (final)
- Result (extra)
- Conclusion
Data serialization is ubiquitous in most areas such as sending and receiving data over the network or storing/reading data from the file system. While JSON is a common modus operandi (especially in JavaScript), using a binary serialization format typically provides an advantage in compression size and performance at the cost of losing human readability of the encoded data.
Two common binary serialization formats across many programming languages are Protocol Buffers and Apache Avro. Avro is inherently a bit more compact than Protobuf, whereas Protobuf uses the additional data as field tags that could make it slightly more forgiving when changing the schema. For those interested, an excellent in-depth explanation has already been written by Martin Kleppmann.
In addition to this, several more recent JavaScript-oriented libraries will be included in the comparison.
This article will mostly focus on the performance aspect and provide a brief overview of each implementation, but as always, there will be pros and cons of each implementation that could be overlooked depending on your use-case.
To reproduce the benchmark results, follow these steps.
- Install Node.js (
12.18.3 LTS
is recommended). - Install dependencies:
npm install
- Use the default configuration or modify
src/config.ts
. - Select what libraries to test by changing
run-tests.sh
or use default that tests all libraries. - Run
run-tests.sh
(if you are on Windows use Git BASH or similar):
cd src
. run-tests.sh
- Create graph images (requires Python with matplotlib installed) by running:
python plot.py
Graph settings can further be configured inside the script. Graph images are by default written to img/
Measurements are accumulated into src/tmp/plot.json
each benchmark iteration. If needed, simply delete the file to reset the graph.
This article only focuses on measurements using JavaScript. Many of the measured formats support additional programming languages that could have different performance characteristics than indicated in this benchmark.
Although outside the scope of this article, compression size (and thereby network transfer speed) can be further improved at the cost of encoding/decoding performance by combining the output with a compressor/decompressor library such google/snappy
or zlib
.
This is the first time I use many of the listed libraries and as such there might still be additional optimizations that I am unaware of. Still, I believe my implementation is a good indication to what most users will end up using.
Feel free to inspect my implementations in src/benchmarks.ts
, and let me know if you find any glaring mistakes (or better yet by submitting a pull request).
The following libraries and versions are tested (sorted by NPM weekly downloads):
bser "6.10.1"
- 7,671kprotobufjs "6.10.1"
- 3,449kbson "4.0.4"
- 1,826kpbf "3.2.1"
- 431kgoogle-protobuf "4.0.0-rc.1"
- 348kavsc "5.4.21"
- 43kprotons "1.2.1"
- 30kavro-js "1.10.0"
- 1.2kjs-binary "1.2.0"
- 0.3k
They are categorized as:
- Protocol Buffer:
protobuf-js
,pbf
,protons
,google-protobuf
- Avro:
avsc
,avro-js
- BSON:
bson
- BSER:
bser
- JSBinary:
js-binary
bser
is a binary serialization library developed for Facebook's "Watchman" filewatcher and is the most downloaded binary serialization library. It is however, mainly used for loca-IPC (inter process communication) as strings are represented as binary with no specific encoding.
google-protobuf
is Google's official Protocol Buffer release, but protobufjs
is by far the more popular library.
avsc
seems to be the most popular Avro library. avro-js
is an offical release by the Apache Foundation but this was excluded from the result section as it seems to be based on an older version of avsc
, contains less features and both libraries yielded very similar benchmark results with a slight advantage to avsc
.
bson
is the official JavaScript BSON library released by MongoDB.
js-binary
is the most obscure library (judging by weekly downloads) and uses a custom binary format that could make interoperability with other programming languages difficult, but this could also make it a good choice due to it being designed with JavaScript in mind.
Each format will be compared against JavaScript's native JSON library as a baseline, regarding compression size and encoding/decoding time.
The data used in the benchmark is a growing array of tuples that is grown in increments of 1/8 of its previous size at each iteration, and (to speed things up) 1/4 when reaching sizes above 10 MB. In a real scenario it could be thought of as a list of vectors in a 2D or 3D space, such as a 3D-model or similarly data intensive object.
To further complicate things the first element of the tuple is an integer. This will give a slight edge to some serialization-formats as an integer can be represented more compact in binary rather than floating-point number.
To all formats that support multiple datatype sizes; integers are encoded as a 32-bit signed integer and decimal numbers are encoded as 64-bit floating-point numbers.
The data is as follows:
[
[1, 4.0040000000000004, 1.0009999999999994],
[2, 4.371033333333333, 0.36703333333333266],
[3, 5.171833333333334, 0.4337666666666671],
[4, 6.473133333333333, 0.36703333333333354],
...
]
The first challenge that arose is that not all measured serialization formats supports root level arrays, and almost no one seems to support tuples, and as such the arrays first need to be mapped to structs as follows:
{
items: [
{x: 1, y: 4.0040000000000004, z: 1.0009999999999994},
{x: 2, y: 4.371033333333333, z: 0.36703333333333266},
{x: 3, y: 5.171833333333334, z: 0.4337666666666671},
{x: 4, y: 6.473133333333333, z: 0.36703333333333354},
...
]
}
This wrapped struct array is the final "payload" that is used in the benchmark unless specified otherwise. This further gives an advantage to some formats over JSON as duplicate information such as field names can be encoded more efficiently in a schema.
It should be noted that the time to convert the unmapped data to the mapped structs is excluded from all measurements in the benchmark.
Each appended element in the growing array is modified slightly so that all elements are unique.
It was discovered that some libraries can considerably impact the performance of other libraries when measured in the same running process, possible due to memory reuse. To prevent this (and to get reproducible results) all measurements in the results section has been measured with each implementation running in an isolated Node.js process.
To reduce unpredictable stalls by the automatic garbage collector in Node.js, the garbage collector is forcefully triggered before each measurement. This did not have any discernable impact on the measured performance other than reducing some randomness.
Compared to the mapped (object with array of structs) data, the unmapped (array of arrays) data is more compact and contains less redundant information. As such additional performance could potentially be gained if the target application uses a similar representation internally.
This is investigated in an additional result section that is found after the main result section.
The benchmark is done in Node.js v12.16.3 on 64-bit Windows 10, with an Intel i7-4790K 4.00GHz CPU and 16 GB RAM.
Protocol Buffer was tested more rigorously than the other formats and is thus given this dedicated section.
An additional format Protobuf (mixed)
is added to the comparison that uses protons
during encoding and protobuf-js
during decoding (this is explained further down).
All protobuf-implementations in the test uses the following proto-file as schema.
syntax = "proto3";
message Item {
int32 x = 1;
double y = 2;
double z = 3;
}
message Items {
repeated Item items = 1;
}
It should be noted that all fields in version "proto3" are optional by default, which could be an advantage feature-wise and disadvantage performance-wise as many of the other measured formats are mandatory by default.
This graph shows the encode/decode time of each implementation in seconds as well as ratio (compared to JSON) given the payload size in MB (measured as JSON). Note that a logaritmic scale is used on the
Encode/Decode time (s)
andJSON size (MB)
axis.
During encoding protons
and Protobuf (mixed)
perfomed the fastest at 2 times faster than native JSON at most payload sizes. protobufjs
and google-protobuf
perfomed the slowest at about 2-3 times slower.
During decoding, protobufjs
, Protobuf (mixed)
performed the fastest at about 5 times faster than native JSON at most payload sizes (although native JSON catches up again at payloads above 200 MB). protons
performed by far the slowest; 20-30 times slower, compared to native JSON.
All implementations (protobuf-js
, pbf
, protons
, google-protobuf
) stayed consistent to the Protocol Buffer format and resulted in an identical compression ratio of 42% (compared to the corresponding file size of JSON) at all measured payload sizes.
This is a ranking of the estimated maximum safe payload limit (measured as JSON) each library was able to process:
pbf
,mixed
: 372 MBJSON
: 298 MBprotobuf-js
: 153 MBgoogle-protobuf
: 98 MBprotons
: 40 MB
When exceeding the payload limit (given the default JavaScript heap size), a heap allocation error occurred in most cases.
JSON | JS | Protons | Pbf | mixed | ||
---|---|---|---|---|---|---|
Prototype pollution | x | x | x | |||
Getters/Setters | x | |||||
Requires unwrapping | x | |||||
Unexpected field renames | x |
This table shows an overview of negative effects during decoding.
pbf
convert cleanly to JavaScript without any detected remnats from the encoding process.
protobuf-js
(which also affects Protobuf (mixed)
) contains serialization remnants (such as type name) hidden in the prototype of the decoded objects, but should behave as a normal plain JavaScript object for most purposes.
protons
should be usable as a data object but wraps all fields into getters/setters that could decrease performance.
google-protobuf
is wrapped in a builder pattern and need to be converted before it can be used and can introduce unexpected field renames. It is however free from metadata after the final conversion.
It is as of now unknown if any of the polluted formats incur an additional overhead to plain objects as this is outside the current scope of this article, but it is something to keep in mind.
protobuf-js
is slow at encoding but fast at decoding.
During encoding it provided mixed result. At sizes below 1 MB it mostly performs better than the native JSON implementation, but at any larger sizes it performs 2 to 3 times worse.
It was the only implementation that reached its max payload limit of 153 MB during encoding, all other formats reached their limit at decoding. It was however discovered that it can decode payloads (created by other implementations) of greater sizes, up to 372 MB.
pbf
is average at encoding and decoding, but is deserialized cleanly without added remnants.
By default pbf
requires an extra build step where boilerplate code is generated from .proto
file, though it seems to offer a few alternatives to streamline the process.
It is also the only Protobuf format that is converted cleanly to JavaScript, which could be a distinct advantage in some cases.
google-protobuf
is slow at encoding, performs average during decoding but might require additional decoding that would further decrease performance, requires extra setup and can cause unexpected renaming of variables.
It does not seem to have an option for deserializing directly form JSON. Instead the following Items
and Item
classes are generated by the protocol buffer compiler that generates a Java-esque builder pattern that the date needs to be mapped into, as outlined here:
...
const ItemsWrap = Schema.Items;
const ItemWrap = Schema.Item;
const itemsWrap = new ItemsWrap();
const itemWraps = data.items.map(item => {
const itemWrap = new ItemWrap();
itemWrap.setX(item.x);
itemWrap.setY(item.y);
itemWrap.setZ(item.z);
return itemWrap;
});
itemsWrap.setItemsList(itemWraps);
return itemsWrap.serializeBinary();
This also unexpectedly renames our array from "items" into "itemsList" which can catch some people of guard and affect the receiving code, as this is not something that is present in other tested Protocol Buffers.
It performs the worst of the implementations during decoding at 2.5 to 3 times slower than native JSON, possible due to the builder overhead.
Deserialization is also misleading. Though it seems to performs only slightly worse than native JSON, the data is still wrapped in the builder object which should be unsuitable for most purposes, and an additional call to ".toObject()" is required to fully convert it back to JSON, which would further decrease the performance and still includes the unexpected name change.
protons
is fast at encoding but very slow at decoding.
The decoded object has all fields wrapped into getters, which might be partially responsible for the poor decoding performance, and while serviceable, could cause some issues depending on how the decoded data is used. The easiest way to remove all getters/setters is to perform a JSON serialization/deserialization which will further increase decoding time.
It was only able to decode payloads of 47 MB in size, but opposite to protobuf-js
it can encode payloads of much greater size.
Protobuf (mixed)
is fast at both encoding and decoding and is good at handling large file sizes.
This implementation is simply a mix of protobuf-js
and protons
, where protons
is used for encoding and protobuf-js
for decoding. This result in the best overall performance of all protobuf implementations and can handle larger payloads than both formats can individually. While this might be too impromptu for most users, it gives us an estimate of how well either of these implementations could perform with some improvements.
Due to poor results and to reduce redundancy, protons
and google-js
will be excluded in further comparisons.
This is the final comparison of the various formats.
This graph shows the encode/decode time of each implementation in seconds as well as ratio (compared to JSON) given the payload size in MB (measured as JSON). Note that a logaritmic scale is used on the
Encode/Decode time (s)
andJSON size (MB)
axis.
During encoding avro-js
performed the fastest of all implementations (with good margin) at 10 times faster than native JSON at most payload sizes, followed by js-binary
and Protobuf (Mixed)
at 2 times faster. Although native JSON once again catches up at payloads above 200 MB (using the default Node.js heap size).
During decoding, avro-js
, protobufjs
, js-binary
, Protobuf (Mixed)
all performed equally well at about 5 times faster than native JSON at most payload sizes. bson
performed the slowest at 1.5 times slower.
This is a ranking of the encoded size (compared to JSON) of each library:
avsc
,js-binary
: 32%:protobuf-js
,pbf
: 42%bser
: 67%bson
: 79%JSON
: 100%
This is a ranking of the estimated maximum safe payload limit (measured as JSON) each library was able to process:
avsc
,jsbin
,pbf
,bser
,Protobuf (mixed)
: 372 MBJSON
: 298 MBprotobuf-js
: 153 MBbson
: 21 MB
BSON | JSBIN | AVRO | BSER | |
---|---|---|---|---|
Prototype pollution | x |
This table shows an overview of negative effects during decoding.
bson
, js-binary
and bser
, all convert cleanly to JavaScript without any detected remnats from the encoding process.
avro-js
, contains serialization remnants (such as type name) hidden in the prototype of the decoded objects, but should behave as a normal plain JavaScript object for most purposes.
avsc
has a more flexible schema definition than most other libraries. Normally an Avro schema is defined in JSON as in this example:
// Type for "float[][]"
const type = avsc.Type.forSchema({
"type": "array",
"items": {
"type": "array",
"items": "float",
}
});
type.schema(); // { "type": "array", ... }
But in avsc
the same schema can be deferred from the data as:
// Type for "float[][]"
const inferredType = avsc.Type.forValue([[0.5]]);
inferredType.schema(); // { "type": "array", ... }
Or as a more complex object:
const AVSC_INT = 0;
const AVSC_FLOAT = 0.5;
const AVSC_DOUBLE = Infinity;
// Type for "{items: { x?: number, y?: number, z?: number}[]}"
const complexType = avsc.Type.forValue({
items: [
{ x: AVSC_INT, y: AVSC_DOUBLE, z: AVSC_DOUBLE },
{ x: null, y: null, z: null },
]
});
Like avsc
it also has a more succinct schema definition than most other implementations.
// Type for double[][]
// Note that all float types in js-binary is 64-bit
const type = new JsBin.Type([['float']]);
// Type for {x?: int, y?: double, z?: double}[]
const complexType = new JsBin.Type({
items: [
{ 'x?': 'int', 'y?': 'float', 'z?': 'float' },
],
});
As mentioned in the Benchmark chapter the original data needed to be simplified to a more verbose format that was supported by all serialization formats. As the unmapped data is more compact and contains less redundant information, additional performance could potentially be gained if the target application uses a similar representation internally. This will be investigated for avsc
, js-binary
, bson
in this section.
As neither formats support tuples, the tuple will be encoded as a 64-bit float array which is expected to increase encoded size ratio slightly as the first field of the tuple can no longer be encoded as a 32-bit integer.
avsc
, js-binary
also has additional settings such as "optional fields", and js-binary
has the datatype JSON that will be investigated with regards to performance.
Performance graph of
JSON
with different settings.
JSON | JSON (unmapped) | |
---|---|---|
Size ratio | 1.00 | 0.77 |
Switching to unmapped data improved both encoding (1/4 faster) and decoding (1/3 faster). It also reduced size ratio to 0.77 of the original JSON.
Performance graph of
avsc
with different settings.
AVRO Ascv | AVRO Ascv (optional) | AVRO Ascv (unmapped) | |
---|---|---|---|
Size ratio | 0.32 | 0.38 | 0.49 |
Payload limit | 372 MB | 372 MB | 237 MB |
Making all fields optional
did decrease performance somewhat from about 10 faster to 4 times faster (though still faster than most other formats). It also increases size ratio slightly.
Switching to unmapped
data also worsened performance similarly. One plausible explanation could be that the tuples are encoded as a dynamically sized array, which would make sense as the schema does not contain any information about the size of the tuple. However, performance is still good compared to other formats. Size ratio was also increased by 63% which is higher than expected as switching from 2 64-bit + 1 32-bit value to 3 64-bit values would only indicate a 20% increase.
Performance graph of
js-binary
with different settings.
JSBIN | JSBIN (optional) | JSBIN (unmapped) | JSBIN JSON (unmapped) | |
---|---|---|---|---|
Size ratio | 0.32 | 0.38 | 0.48 | 0.77 |
Switching to unmapped
had a slight improvement on encoding speed, but apart from this optional
and unmapped
had almost no impact on performance.
Increase in size ratio for both optional
and unmapped
is identical to avsc
.
Encoding all data using the js-binary
datatype json
performed almost identically to JSON (unmapped)
as well as size ratio. This seems to indicate that datatype json
simply consists of using native JSON to create a string that is then stored as datatype string
in js-binary
.
Performance graph of
bson
with different settings.
BSON | BSON (unmapped) | |
---|---|---|
Size ratio | 0.79 | 0.79 |
Unmapped BSON is still slower than JSON in most cases, but it did receive a performance improvement, especially during decoding.
For some unexplained reason the encoded size ratio remained the same for both the mapped and unmapped data. This seems to indicate that BSON can optimize duplicate data, although the encoded size ratio is still relatively large compared to other formats.
The libraries js-binary
, pbf
, bser
and bson
all convert cleanly back to JavaScript without any detected remnants from the encoding process. avro-js
and js-binary
contained minor remnants. google-protobuf
and protons
had major remnants or ramifications.
Overall avsc
performed very well in all measurements, was easy to setup and seems to be the overall best performing serialization library. Switching from mandatory to optional fields slightly worsened performance and compression ratio, but still puts it on top. Performance was also slightly worse when processing small arrays compared to similar sized structs.
js-binary
performed well in all measurements, was easy to setup and is deserialized cleanly. One disadvantage being that it uses a custom binary format that could make interoperability with other programming languages difficult. Switching from mandatory to optional fields had almost no impact on performance.
Regarding Protocol Buffer libraries it should be noted that all fields are optional by default in the latest schema version.
protobuf-js
was slow at encoding but fast at decoding, protons
was fast at encoding but very slow at decoding. However, through a combination of using protons
during encoding and protobuf-js
during decoding, performance on pair with js-binary
(with slightly worse encoding size) could be achieved.
pbf
performed only slightly better than native JSON but is deserialized cleanly without any remnants.
bser
was slower than native JSON at both encoding and decoding (slightly slower than BSON). It was however very good at handling large payloads, and provides a good compression ratio considering it does not require a schema, and is deserialized cleanly.
bson
was slower than native JSON at both encoding and decoding, provided only a modest compression ratio, and was bad at handling large payloads, but does not require a schema and is deserialized cleanly.
google-protobuf
was slow at encoding, performs average during decoding but might require additional decoding that would further decrease performance, requires extra setup and can cause unexpected renaming of variables.