Can we make binary encodings "human-readable"? This post explores this question by means of implementing a library inspired by Erlang's bit syntax.
JSON is probably the most commonly used format for serialising data today. A frequent argument for using it is that JSON is human-readable.
What does that mean exactly? I suppose that people usually mean two things.
First, it's less verbose than XML, making it easier to read. Most people would
probably still call XML human-readable, but arguebly less so than JSON. Second,
it's easier to read than binary encodings produced by MessagePack, ASN.1 or
Protobuf, etc. For example, the JSON string "foo"
is represented by the
following byte sequence when using MessagePack:
+------------ A string of length 3 consisting of ...
| +--------- ... the character 'f', following by ...
| | +--+--- ... two 'o' characters.
| | | |
v v v v
a3 66 6f 6f
If we were to open a file with the above bytes or echo them to the terminal we'd
see £foo
. Which, while one character shorter1 than the JSON string, is
starting to become unreadable already and it will become worse once the JSON
object is more complicated.
It's worth noting that all serialised data ends up being bytes once on disk or sent over the network. So in a sense one could argue that the reason JSON is human-readable, is because these bytes get displayed as ASCII or UTF-8 by our editors and the standard terminal utilities. Another way to think about it is that ASCII and UTF-8 are encodings as well, which would be unreadable without tool support. This isn't a new argument, people like Joe Armstrong and Martin Thompson have separately and on multiple occasions pointed this out. Both stress that we are wasting massive amounts of CPU cycles on parsing JSON.
It's not just that it's less space efficient, as we saw with "foo"
vs £foo
,
it's also because with JSON we need to inspect every single character after the
first "
in order to determine when the string ends, i.e. finding the closing
"
. Whereas in, for example, the MessagePack case the length of the string is
encoded in the a3
byte so we can jump forward and just copy the three bytes
(without looking at them). Joe calls this reconstructing as opposed to
parsing.
So if JSON is merely human-readable because of our application-level tooling, this raises the question: what would it take to make binary encodings "human-readable"?
For starters I think we'd need to make it easier to work with binary data in our programming languages. I believe Erlang's bit syntax, which lets us do bit-level pattern-matching, is a good example of what better language support for working with binary data looks like. Even though Erlang's way ahead most programming languages on this front, there are important use cases which are not possible to express efficiently using bit syntax though, e.g. in-place updates, leaving more to be desired.
In the rest of this post we'll have first have a look at how Erlang's bit syntax works, then we'll turn to its shortcomings and try to start addressing them by means of implementing a library.
Erlang has a feature called bit syntax which allows the user to encode and decode data at the bit-level. Here's an example, where we encode three integers into two bytes:
1> Red = 2.
2
2> Green = 61.
61
3> Blue = 20.
20
4> Mem = <<Red:5, Green:6, Blue:5>>.
<<23,180>>
Normally, even the smallest integer takes up one byte (e.g. char
in C or
Int8
in Haskell) but Erlang's bit syntax lets us encode, e.g., Red
using
only 5 bits (rather than the default 8 bits) and thus we can fit all three
integers in
We can also pattern match at the bit-level using sizes to get our integers back:
5> <<R1:5, G1:6, B1:5>> = Mem.
<<23,180>>
6> R1.
2
7> G1.
61
8> B1.
20
For larger integer types, e.g. 0x12345678 :: Int32
, we can also specify the
byte order or endianness:
1> {<<16#12345678:32/big>>,<<16#12345678:32/little>>,
<<16#12345678:32/native>>,<<16#12345678:32>>}.
{<<18,52,86,120>>,<<120,86,52,18>>,
<<120,86,52,18>>,<<18,52,86,120>>}
For a slightly larger example, here's pattern-matching on an IP datagram of IP protocol version 4:
-define(IP_VERSION, 4).
-define(IP_MIN_HDR_LEN, 5).
...
DgramSize = byte_size(Dgram),
case Dgram of
<<?IP_VERSION:4, HLen:4, SrvcType:8, TotLen:16,
ID:16, Flags:3, FragOff:13,
TTL:8, Proto:8, HdrChkSum:16,
SrcIP:32,
DestIP:32, RestDgram/binary>> when HLen >= 5, 4*HLen =< DgramSize ->
OptsLen = 4*(HLen - ?IP_MIN_HDR_LEN),
<<Opts:OptsLen/binary,Data/binary>> = RestDgram,
...
Note how we can match on the header length, HLen
, and later use the value of
that match as the size when pattern matching on later values.
We can implement a library that lets us do similar things to Erlang's bit syntax, but in a more clunky way (it's difficult to beat native syntax support).
import BitsAndBobs
let
pattern0 = sized word32 5 ::: sized word32 6 ::: sized word32 5 ::: Nil
bytestring0 = byteString [2 :. sized word32 5, 61 :. sized word32 6, 20 :. sized word32 5]
bitMatch pattern0 bytestring0
-- => (2,(61,(20,())))
let
pattern1 = word8 :>>= \sz -> sized bytes sz ::: bytes ::: Nil
bytestring1 = byteString [5 :. word8, "hello, rest" :. bytes]
bitMatch pattern1 bytestring1
-- => (5,("hello",(", rest",())))
The above is implemented in Haskell, but should be straightforward to port to most languages using the following recipe.
The high-level idea when encoding a bunch of, possibly sized, values into a
ByteString
is as follows:
- For each value convert the value into a list of booleans (or bits);
- If the value is sized then only take that many bits, otherwise if it isn't
sized use the default value, e.g.
Int8
= 8 bits,Float
= 32 bits, etc; - Concatenate the lists of booleans for each value into a single list of booleans;
- Split the list in groups of 8 bits;
- Convert each 8 bits into a byte (
UInt8
); - Create a
ByteString
from list ofUInt8
s.
For decoding or pattern-matching a, possibly sized, pattern against a
ByteString
the idea is:
- Convert
ByteString
into list of booleans (or bits); - For each pattern take its size many bits from the list;
- Convert the bits into the value type of the pattern;
- Continue matching the remaining patterns against the remaining bits.
Float
and Double
s get converted into UInt32
and UInt64
respectively
before converted into bits, and Int
egers are encoding using
zigzag
encoding2.
Erlang's bit syntax makes it possible to decode binary data into the host languague's types, which can then be manipulated, and finally encoded back to binary.
While already useful, it doesn't cover some interesting use cases. Let me try to explain the use cases and at the same time sketch possible ways we can extend Erlang's bit syntax to cover those.
What if we merely want to update some binary in-place without reading it all in and writing it all back out?
For example, the de facto standard for metadata format for mp3 files is called ID3. This was never part of the mp3 specification, but rather added afterwards and so in order to not break backwards-compatibility with old media players they added it at the end of the file.
Lets imagine we wanted to write a metadata editor for mp3 files using Erlang's bit syntax. I think no matter how smart the Erlang run-time is about bit syntax, it's hard to imagine that it wouldn't need to deserialse and serialise more data than necessary. Worst case it would deserialise all of the audio that leads up to where the metadata starts, but even if it's somehow clever and starts from the back then we'd still probably need to at least deserialise all fields preceding the field we want to update.
Inspired by this problem and how tools like poke
work, I've started another experiment based on Schema
s with this use case in
mind, here's an example session of editing the metadata of an mp3 file:
$ cabal run mp3 -- /tmp/test.mp3
mp3> help
schema | read <field> | write <field> <value> | list | q(uit)
mp3> schema
audio : Binary
header : Magic "TAG"
title : ByteString (Fixed 30)
artist : ByteString (Fixed 30)
album : ByteString (Fixed 30)
year : ByteString (Fixed 4)
comment : ByteString (Fixed 30)
genre : UInt8
mp3> read title
Unknown
mp3> write title "Bits and Bobs"
mp3> read title
Bits and Bobs
mp3> list
Right (Id3V1 {title = "Bits and Bobs", artist = "", album = "", year = "2023", comment = ""})
mp3> quit
The user needs to specify the Schema
, which is closely mapped to the ID3v1
specficiation and the rest is provided by the library. In particular all the
offsets to the different fields are calculated from the schema3, which allow us
to jump straight to the field of interest and reconstruct it without parsing.
The above interactive editor is completely
generic and works for any Schema
!
If we can read and update fields, it should also be possible to get diffs and patches for cheap.
Now that we can edit files in-place on the disk it would be nice to use this in order to implement on-disk data structures. For example imagine we'd like to do some kind of logging. If our schemas could express arrays and records we could define our log to be an a struct with a length field and an array of records field that of size length. In addition to extending the schema with arrays and records, we'd also need atomic increments of the length field so that we can in a thread-safe manner allocate space in our array. B-trees or Aeron's log buffers would be other interesting on-disk data structures to implement.
The generic editor would be useful for debugging and manipulating such data
structures, but we'd probably want more tooling. For logging we probably want
something like cat
and grep
but generic in Schema
.
When we read
a ByteString
field in the mp3 metadata example above, we copied
the bytes from the underlying file. Sometimes we might want to avoid doing that.
For example imagine we are implementing some network protocol. We can use a
pre-allocated buffer and recv
bytes from a
socket into this buffer (avoiding allocating memory while handling requests),
once the request is inside our buffer we can decode individual fields (without
parsing) and from that we can determine what kind of request it is. Let's
imagine it's some kind of write request where we want to save the payload of
some field to disk. It would be a waste to copy the bytestring of the payload
only to write it disk immediately after, since the network request consists of
raw bytes and that's what we want to write to the disk anyway. Instead we'd like
to be able to decode the payload field as a pointer/slice of the buffer which we
pass to write
(thus avoiding copying aka
"zero-copy").
Another big topic is schema evolution. How can we maintain backward- and forward-compatibility as our software evolves? We probably want to be able to migrate old formats into newer ones somehow also.
Currently our schemas cannot express how to compress fields on disk, or how to avoid sending unnecessary data in consecutive network messages.
An example of the former might be to compress a bytestring field, using say deflate, before writing it to disk. While an example of the former might be to only send the difference or change of some integer field, instead of sending the whole integer again. To make things more concrete, lets say the integer represents epoch time and we send messages several times per second, then by only sending the difference or delta in time since the last message we can save space. Other examples of compression include dictionary compression, run-length encoding, bit packing and Huffman coding.
It would be neat if encoding and decoding fields could be done modulo
compression! Likewise the schema-based cat
and grep
could also work modulo
compression.
A related topic is storing our data in a row-based or columnar fashion. Take the example of a logging library we discussed earlier with a schema that's an array of records, i.e. each log call adds a new record to the array. This is nice in terms of writing efficiency, but if we wanted to do a lot of grepping or some aggregation on some field in the record then we'd have to jump around a lot in the file (jumping over the other fields that we are not interested in). It could be more efficient to restructure our data into a record of arrays instead, where each array only has data from one field, that way searching or aggregating over that field would be much more efficient (no jumping around). Some compression is also a lot easier to apply on columnar data, e.g. delta and run-length encoding. Perhaps it would make sense if the schema-based tools could do such data transformations in order to optimise for reads or archiving?
If we can do encoding and decoding fields modulo compression, why not also handle checksums transparently? When we update a field which is part of a checksum, we'd probably want to check the checksum beforehand and recompute it afterwards.
What if some input bytes don't match the schema? Currently all magic tags in a schema get verified, but sometimes we might want to be able to edit incomplete or malformed inputs.
Can we add refinements to the schema which allow us to express things like, integer between 18 and 150 or bytestring containing only alphanumeric characters, etc?
So far we've looked at how to specify what data our programs use and how it's transformed to and from bytes on disk or over the network. Another important aspect is what protocol is followed when said data is sent between components in the system.
For example consider some client-server application where our schema describes the request and responses:
flowchart LR
Client -- request --> Server
Server -- response --> Client
The schema doesn't say anything about in which order requests are legal. For
example, we might want to always requrie a login-like request at the start of a
session. Or let's say we are describing a POSIX-like filesystem API, then
read
s and write
s must only be made on open
(and not yet close
d) file
descriptors.
Joe Armstrong wrote a paper called Getting Erlang to talk to the outside world (2002) which discusses this problem. He proposed a language for describing protocols and a dynamic sessions type checker, it never seemed to have got much traction though even though he gave several talks about it. One implementation can be found here.
There's this neat tool called pandoc
that
makes possible to convert between different text formats, e.g. from Markdown to
HTML.
The list of supported formats to convert from and to is pretty long. If we were
to convert to and from each pair of possibilities would require pandoc
does instead is to
convert each format to and from its internal abstract representation, thereby
reducing the problem to
Could we do something similar for binary encodings?
In the book Development and Deployment of Multiplayer Online Games, Vol. I by
Sergey Ignatchenko (pp. 259-285, 2017) the author talks about how most
IDLs, e.g.
Protobufs, have the same language for describing what the abstract data which
we want to serialise and how we actually want the data to be serialised. By
separating the two, we could change the binary format "on the wire" without
changing the application which operates on the abstract data (the what part).
A clearer separation between IDL and its encoding could perhaps be useful when
trying to solve the pandoc
problem for binary.
Another way to think of this is: can we make a DSL for IDLs?
-
Q: Why not just use Protobuf?
A: Except for backward- and forward-compatibility, I don't think Protobufs can handle any of the above listed use cases. Also the way it handles compatibility with it's numbered and optional fields is quite ugly4.
-
Q: Writing safely to disk without going via a database is almost impossible!?
A: Dan Luu has written about this on several occasions. Short answer: don't store anything you are worried about losing using this library. Longer answer: I'd like to revisit this topic from the point of view of testing at some later point in time. In particular I'm interested in how we can make the results from the paper All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications (2014) more accessible, especially their tool ALICE: Application-Level Intelligent Crash Explorer.
The current implementation is in Haskell, but I'd really like to encourage a discussion beyond specific languages. In order to make binary "human-readable" we need solutions that are universal, i.e. work in any language, or perhaps better yet than libraries -- extend programming languages with something like Erlang's bit syntax.
- Do you have use cases that are not listed above?
- Do you know of tools, libraries or solutions any of the above use cases that have already not been discussed or are not listed below in the "see also" section?
- Do you know if some use cases impossible in general or incompatible with each other?
- Interested in porting any of these ideas to your favorite language?
If so, feel free to get in touch!
- The Erlang reference manual on bit syntax;
- Programming examples of bit syntax from the Erlang user's guide;
- Joe Armstrong's PhD thesis (2003) also has a section on bit syntax on p. 60;
- Native Code Compilation of Erlang's Bit Syntax (2002);
- Cap’n Proto;
- Simple Binary Encoding (SBE) by Martin Thompson et al;
- GNU poke, extensible editor for structured binary data;
- fq: jq for binary formats also described in this talk;
- Terminal JSON viewer;
- Rust's binrw crate;
- Designing Data-Intensive Applications by Martin Kleppmann (chapter 3-4, 2017);
- Development and Deployment of Multiplayer Online Games, Vol. I by Sergey Ignatchenko (pp. 200-216 and 259-285, 2017).
Footnotes
-
The savings are greater for more complicated JSON objects, especially considering JSON doesn't support binary data which needs to be either escaped or base64 encoded before used as a string. ↩
-
I don't think Erlang uses zig-zag encoding of integers, in fact I'm not sure what it does with them. ↩
-
The library tries to calculate the offset of a field from the start of the file, in this case the beginning of the file contains an audio binary "field" of unknown length, so it fails and retries calculating the offset from the end of the file instead. ↩
-
Avro has a nicer story for compatibility. ↩