IPTF brings the best parts of IPFS into the world of TensorFlow. If you're not familiar with IPFS, think of it as a cross between git and BitTorrent: a global peer-to-peer network of machines that provide instant, reliable access to massive datasets.
IPTF is free, it’s peer-to-peer and it requires zero configuration. Use IPTF in your machine learning workflow to handle enormous quantities of data without needing to setup, configure, or orchestrate a separate storage solution.
By embedding an IPFS daemon directly into TensorFlow, IPTF provides seamless read/write access to the entire IPFS network. IPTF is fully integrated with full tf.gfile and all of the io_ops, so iptf://
prefixed paths can be used in Python directly or with any core file operation in a TensorFlow graph.
Instead of downloading each of the MNIST files and storing them somewhere on your machine, just use the path "iptf://ipfs/QmazxWBPrSfTzkuzQNvyzYWx438q98Q1pQ7vRJkQZh7x97"
, and TensorFlow will fetch and cache the data you need on the fly.
The QmazxWBPrSfTzkuzQNvyzYWx438q98Q1pQ7vRJkQZh7x97 part of the path is the fingerprint of the MNIST dataset. The only file in the world that has that fingerprint is the MNIST dataset. So not only is it enough information to request that file from the network, it's also enough information to check that you received a perfect, uncorrupted copy of MNIST. Similar to BitTorrent, IPTF downloads pieces of your files from many computers in parallel and still guarantees you're getting exactly the right sequence of bytes.
Addressing files and directories by their content helps optimize bandwidth usage as well. Fetching cached blocks from peers on your local network conserves use of internet bandwidth for data that hasn't been fetched from the Internet yet.
Datasets aren't the only large files in machine learning. Model weights get large as well. You can use IPTF to load a model like Inception in a single line of code.
Reproducibility is a key aspect of machine learning. IPTF provides a simple way to reproduce any training result at any time. When IPTF loads data to feed to a model, it remembers the fingerprint of the data in a way that's easy to refer to later, similar to git’s reflog. This makes it easy to reuse any data you've used in the past to reproduce a result or compare a new model to an old one.
Using IPTF during your project’s exploration phase has a bonus benefit: storage efficiency. Whenever new data is stored in IPTF, only the blocks that differ from existing files will be stored.
Run Python smoke tests
$ bazel run //iptf/python:file_system_test
Run C++ tensorflow::FileSystem
tests
$ bazel run //iptf/cpp:file_system_test
To experiment with IPTF-enabled TensorFlow on an existing project while still building from source
$ bazel run //iptf/python:iptf -- python foo.py bar
- Walkthough of how to use IPTF in a real project
- Example Jupyter notebook
- Build and test pip package on macOS
- Get build working on linux
- Build and test pip package on linux
- Run
go fmt
on the source - Clean up Python smoke test
- Check that docker container that can run IPTF-enabled Tensorboard
- Write up real README
- Write up build instructions
- Push upstream SWIG patch(es)
- Push upstream TensorFlow patch(es)
- Push upstream go-ipfs patch(es)
-
iptf
standalone command-
iptf serve [...]
-
iptf tensorboard ...
-
iptf python ...
-
- Performance testing
- IPFS read/write performance
- IPFS network performance
- No memory leaks load