Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASDF post #259

Merged
merged 6 commits into from
Nov 28, 2018
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
250 changes: 250 additions & 0 deletions content/post/asdf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,250 @@
---
vmarkovtsev marked this conversation as resolved.
Show resolved Hide resolved
author: vadim
date: 2018-11-26
vmarkovtsev marked this conversation as resolved.
Show resolved Hide resolved
title: "Why we chose Advanced Scientific Data Format for ML models"
image: /post/asdf/logo.png
description: "What ASDF is, why it is awesome, why you should probably use it, and how. Why we adopted ASDF in source{d} ML projects."
categories: ["technical"]
---

There is a project we are developing at source{d} named [Modelforge](https://github.com/src-d/modelforge).
Its goal is to abstract the serialization and the retrieval of the machine learning
models from the users. That is, it solves the following two major problems
every ML engineer hits:

1. What to save a trained model and load it back.
2. How to reference and distribute trained models.

Let's not go offtopic with discussing (2) - there are many funny subproblems there
which deserve a dedicated blog post about Modelforge. Instead, the focus will be on
serialization. The Python crowd is used to solve (1) with
[`pickle`](https://docs.python.org/3/library/pickle.html) - the built-in
serialization module. It works nicely while these three conditions hold:

1. The origin of the files is trusted and secure.
2. The interoperability with other languages is not needed.
3. The serialized objects are not big.

As some know very well, `pickle` contains a full-featured virtual machine inside,
which is interpreted to recreate the serialized Python objects. The advantage
vmarkovtsev marked this conversation as resolved.
Show resolved Hide resolved
is that we can serialize any object, except for few rare types
like mutexes or executable code. The implied drawbacks are substantial though:
we should not load a pickle from unreliable source because it is vulnerable
to remote code execution attacks; the loader must implement a virtual machine,
which is hard with other programming languages; the memory consumption
during serialization and deserialization can grow far beyond the original object size.
The latter is important for machine learning and data science in particular:
the objects can grow very big, say, bigger than a few hundred megabytes and
pickling them is very slow and memory consuming.
We witnessed 2 and even 3 times bigger memory consumption during
`pickle`-ing which led to fun situations when one has successfully computed
the resulting data, but the whole effort goes to waste when
there is not enough RAM for the pickling process to take place.

So when we started Modelforge, we searched for the best serialization format
which is **not** `pickle`.

{{% caption src="/post/asdf/pickle_rick.png" %}}
"I am a pickle, I get out of serializing huge tensors."
{{% /caption %}}

## Life beyond `pickle`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep on thinking of pickle-rick!

pickle rick!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Guillemdb you need to see this 🤣

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hahahahaha Pickle rick seal of aproval!
pickle rick

Btw, have you already figured out how to avoid the lazy loading of asdf files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did in asdf-format/asdf#573 Waiting impatiently for 2.2.0 release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YAAAY! now I can use ASDF to feed data to RL models! @vmarkovtsev you are awesome!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @vmarkovtsev! I hope I can get a release out next week. Feel free to ping me if you haven't heard anything by Weds.


Our best-shot requirements were:

1. Binary format. It is impossible to efficiently save a huge dense tensor in JSON or YAML.
2. At the same time, it would be nice to be able to save any JSON-like metadata without pain.
3. The schema should be optional. Formats that require an extra schema file in
a separate file make introspection harder while requiring ML researchers to
maintain an extra piece of code. Pythonistas expect introspection, so for
an idiomatic Python library this is essential.
4. The typical data size may span over tens of gigabytes. Compared to gigabytes,
the overhead of including a schema looks irrelevant indeed, hence the files
become self-descriptive and the schema turns into a validation perk.
This requirement rules out Protocol Buffers completely, because each PB file
is always fully loaded into memory in all the existing implementations.
The same requirement excludes popular binary serialization formats which target
messaging and RPC and are not optimized for large messages with few items.
1. Python should be a first-class citizen. numpy arrays should be serializable
without any additional code.
6. Yet the format should not require Python. The user-facing applications at source{d}
are written in Go and otherwise it will be hard to integrate.
7. On-the-fly compression. NLP models often contain strings which can require much
space while being perfectly compressible. Integers can be compressed too since
we sometimes don't know their range beforehand and use 32 bits while only 16 are
really needed.

[HDF5](https://support.hdfgroup.org/HDF5/) was the closest to those. It is binary,
there is no schema, it supports big tensors, Python bindings are mature and
well-integrated, there are bindings for other languages. HDF5 is used in e.g. [Keras](https://github.com/keras-team/keras).
However, it is not ideal in terms of performance and lacks some modern features.
Read ["Moving away from HDF5"](https://cyrille.rossant.net/moving-away-hdf5/).

We previously had some positive experience with SQLite + SQLAlchemy on top,
but of course that variant does not stand big data blobs (4).
There is a common workaround for (4): it is possible to store huge tensors as external files.
This implies concatenating all the blocks together before uploading and splitting
them back after downloading, e.g. as a TAR or a ZIP without compression.
Those meta-archives are typically used as a temporary transfer medium
(e.g., [TensorFlow Hub](https://www.tensorflow.org/hub/hosting)).
In turn, this means the higher usage complexity, the doubled requirement for free disk size,
vmarkovtsev marked this conversation as resolved.
Show resolved Hide resolved
the increased vulnerability to data corruptions and the explosion on the number
of open file descriptors.

No more intrigue: we discovered ASDF.

## ASDF

[Advanced Scientific Data Format](https://github.com/spacetelescope/asdf) (ASDF)
is a next generation serialization format for scientific data. This means that
it focuses on storing sparse and dense tensors in an efficient way.
The ASDF project started by [Michael Droettboom](https://github.com/mdboom) (Matplotlib; astropy)
vmarkovtsev marked this conversation as resolved.
Show resolved Hide resolved
at SpaceTelescope Institute in 2014. ASDF is not implementation-driven, and it is based on
the well-defined [standard](https://asdf-standard.readthedocs.io/en/latest/).
However, there is only one maintained software library written in Python.
While historically ASDF targeted astronomers, it is actually abstracted away
from any scientific domain and is completely versatile. ASDF features:

* Transparent, automatic, on-the-fly compression and decompression with zlib, bzip2 or lz4.
* YAML header with binary blocks appended to the tail. All the perks of using YAML are preserved.
* Uncompressed tensors can be [memory mapped](https://en.wikipedia.org/wiki/Memory-mapped_file) so that the operating memory consumption is very low with tiny performance penalty for sequential read and write. A killer feature if your tensors are big.
* Data structure can be validated with YAML schemas.
* Python and numpy arrays are first-class citizens.
* The tagging mechanism allows to extend for new binary data types easily. Though it is rarely needed in practice.
vmarkovtsev marked this conversation as resolved.
Show resolved Hide resolved
* The schema is there but is completely optional.
vmarkovtsev marked this conversation as resolved.
Show resolved Hide resolved

## Code examples

By default, ASDF reads tensors from disk lazily upon the first reference.
Thus the opening of an ASDF file is very fast, and it is easy to quickly introspect.
The contents are always a Python `dict` and they are placed into the `tree` attribute.

```python
import asdf

with asdf.open("file.asdf") as f:
print(f.tree)
```

Let's create a new ASDF file and see what's inside.

```python
import io

import asdf
import numpy

tensor = numpy.ones(10, dtype="int32")
buffer = io.BytesIO()
asdf.AsdfFile(tree={
"tensor1": tensor,
"tensor2": tensor,
"meta": "data",
"int": 100
}).write_to(buffer)
print(buffer.getvalue().decode("utf-8", errors="backslashreplace"))
```

We should see

```
#ASDF 1.0.0
#ASDF_STANDARD 1.2.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: Space Telescope Science Institute, homepage: 'http://github.com/spacetelescope/asdf',
name: asdf, version: 2.1.0}
history:
extensions:
- !core/extension_metadata-1.0.0
extension_class: asdf.extension.BuiltinExtension
software: {name: asdf, version: 2.1.0}
int: 100
meta: data
tensor1: !core/ndarray-1.0.0
source: 0
datatype: int32
byteorder: little
shape: [10]
tensor2: !core/ndarray-1.0.0
source: 0
datatype: int32
byteorder: little
shape: [10]
...
\xd3BLK0(((\x9dZ\x82\xde\xf2_\x8f\x83<B\xaa\xa4g\xde#ASDF BLOCK INDEX
%YAML 1.1
--- [615]
...
```

Tree items which are not binary - "meta" and "int" - have been written inline in YAML.
ASDF has been intelligent enough to serialize only one copy of the array.
It is placed at the end of the file and forms a "block". `!core/ndarray-1.0.0`
is a tag name which identifies the built-in tensor data type. "source" field
references the block containing the array elements.

It's also very cool that the block offset indexes table is placed at the bottom
of the file, so that it is possible to append without rewriting the whole file.
vmarkovtsev marked this conversation as resolved.
Show resolved Hide resolved

Let's try with compression now.

```python
import io

import asdf
import numpy

tensor = numpy.zeros(1000000, dtype="int32")
buffer = io.BytesIO()
asdf.AsdfFile(tree={
"tensor": tensor,
}).write_to(buffer, all_array_compression="zlib")
print(buffer.getvalue().decode("utf-8", errors="backslashreplace"))
```

The zlib algorithm is very efficient at compressing zeros, so the result is expected and awesome:

```
#ASDF 1.0.0
#ASDF_STANDARD 1.2.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: Space Telescope Science Institute, homepage: 'http://github.com/spacetelescope/asdf',
name: asdf, version: 2.1.0}
history:
extensions:
- !core/extension_metadata-1.0.0
extension_class: asdf.extension.BuiltinExtension
software: {name: asdf, version: 2.1.0}
tensor: !core/ndarray-1.0.0
source: 0
datatype: int32
byteorder: little
shape: [1000000]
...
 \xf7Om\xf0n::= \x90\xb7A\xf9\x84\xe3\xbb@\x8d\x92\xfaI\xf9\xc46x\x9c\xed\xc1
\x93#ASDF BLOCK INDEX
%YAML 1.1
--- [506]
...
```

## Summary

[ASDF](https://github.com/spacetelescope/asdf) is a relatively new hybrid YAML+binary blocks format.
It is implemented for Python only at the moment, and I really hope that
the community will help with covering other programming languages (Go please!).
ASDF is very nice for storing scientific data such as tensors and suits
machine learning models serialization well. source{d} successfully uses it in
[Modelforge](https://github.com/src-d/modelforge) -
a framework to serialize and distribute MLonCode models.

Don't want to miss the next blog post about how source{d} ML team does R&D?
Subscribe to [our newsletter](http://go.sourced.tech/newsletter), follow
[@sourcedtech](https://twitter.com/sourcedtech) on Twitter and don't forget
about our [Paper Reading Club](https://github.com/src-d/reading-club).
Oh, and we are organizing the
[MLonCode developer room at FOSDEM'2019](https://medium.com/sourcedtech/ml-on-code-devroom-cfp-fosdem-2019-4f867f128e21#a948) - the call for proposals is open!
Binary file added static/post/asdf/logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/post/asdf/pickle_rick.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.