-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Modular TensorFlow #77
Conversation
Here are my proposed changes, to streamline the introduction and correct a few typos.
Update 20190305-modular-tensorflow.md
|
||
The ML ecosystem is constantly expanding. New hardware for accelerating ML applications is being worked on by many teams inside and outside Google. As the most popular machine learning framework, TF is expected to add support for many of these hardware as quickly as possible. | ||
|
||
Currently, this means that all such hardware developers need to check in their code into the [main tensorflow repository](http://github.com/tensorflow/tensorflow). This means that all the changes are required to go through TF team's review. This can make merging support for new hardware very very difficult. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... or maintain a fork of TF, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kernel way without GPL enforcing ends you up with Caffe, as the saying goes by, “there are a thousand versions of Caffe in a thousand contributors’ eyes”.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, here are two examples I am aware of:
https://github.com/ROCmSoftwarePlatform/tensorflow-upstream
https://github.com/codeplaysoftware/tensorflow
|
||
1. These cannot export any symbols into global namespace to avoid symbol collisions. | ||
1. They need to communicate to TF through any provided C APIs. | ||
1. They can link anything they like, anyway they like in addition to TF. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would TF core APIs be called from within a plugin? e.g. If an MKL plugin signals that it can handle a usage, but as part of that usage, the MKL plugin needs to call the eigen implementation in TF core, how would that handoff be managed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting use case. I can think of two potential solutions:
1 - MKL plugin can use the global kernel registry to fetch the eigen kernel and execute it.
2 - MKL plugin can rebuild all eigen kernels and also distribute all.
We can brainstorm more on this issue, I think it is certainly worth considering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A global kernel registry sounds interesting. @agramesh1, @nhasabni do you see performance implications here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a global kernel registry in TF:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/op_kernel.cc#L1048
|
||
* Core TF functionality will be implemented in C++ | ||
* Core TF functionality can be extended using shared objects. | ||
* On top of the core C++ libraries, we will have the language bindings (Using the C API) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in this developer thread, it is still not very clear to me how protobufs will be exposed by the C API.
For example, the language bindings will need to retrieve the list of available operations, which is currently represented by a list of OpDef
described in opdef.proto. How will that information cross the C API?
- by returning the protobufs as a series of bytes that should be deserialized by the bindings, like in TF_GetAllRegisteredKernels?
- by exposing a snapshot of the protobufs as a text file that should be parsed by bindings, like in ops.pbtxt?
- by wrapping protobuf messages in a C structure exposed in the API?
In other words, can a .proto
file of TensorFlow be part of the contract with its modules or should they be used only internally by the core?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our solution to this will be 3.
We would like to remove all protobufs from our APIs.
We will push for using protobufs for only serialization formats, for RPCs or data we would store on disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gunan It is not clear how to add new operators to a module plugin. Currently, we could create new operators through OpDef
and optionally set visibility hidden. We have been doing this for MKL-DNN quantization work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to use a mechanism similar to custom ops, but simplified. With a C ABI, we will be able to make custom ops much easier to build and distribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick question, how will the TensorFlow libraries be published to be used by language bindings?
Right now, pretty much all bindings have their own distribution (PiP package, Maven artifacts...) that bundles both the client and the TF core libraries. But with module TF, will the core libraries built, tested and published by Google so the clients just need to pick them for distribution?
|
||
**Modules:** These are the components of core TF library that will "accept" plugins to expand their capabilities. Examples of modules are networking, filesystem, (graph) optimizers. | ||
|
||
**Plugins:** Plugins are extensions to different modules. For example, filesystem module can have a GCP plugin, an S3 plugin, an HDFS plugin. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is defining an SDK. When can we get drop? That will generate lots of feedback/questions...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sjamesr is working on this. We are trying to build the API based on OpKernel
and OpKernelContext
classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A further clarification.
The SDK will start as what it is today.
We will try to make all the kernel code work as is. Therefore, you can think of the current TF C++ headers as v0 of what is to come for TF C++ APIs.
|
||
|
||
1. These cannot export any symbols into global namespace to avoid symbol collisions. | ||
1. They need to communicate to TF through any provided C APIs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thread-pool management? How do we coordinate between TF Core and plugins so as to not over/undersubscribe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our expectation for the moment is that threadpool management will be much as it is today.
That is, plugins will typically use a threadpool provided to them
(such as via OpKernelContext); the key difference would be that the interface
would be provided as part of a stable ABI, so that it can be used across the boundaries
of shared libraries. Thus, we don't expect that interactions will be substantially
better or worse than today.
We cannot prevent the author of a plugin from creating a threadpool local to that plugin,
but, as you are perhaps suggesting, to do so risks making interactions with other
threadpools within the address space be less predictable. That might well make such a
plugin less desirable by clients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow on to @claynerobison 's comment. Currently we use different default threadpool settings and also set some MKL specific settings when MKL is enabled (see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/threadpool_device.cc) Would there be a way to do this from the plugins?
#### Plugins | ||
|
||
Plugins need to include implementation of the interfaces declared by one module. If the module interface requires Init and Compute methods, it will need to implement these two functions, plus a TF_InitPlugin function which will be called at load time. This function will also need to register the plugin as prescribed by the module. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SDK/API architecture has the potential to influence performance of plugins for better and worse. If an API design favors one plugin and hurts another, what is the plan to resolve the conflict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great question. From our side, we will try to be as neutral as possible. Initial APIs will likely be based on the current C++ APIs TF has. But the API designs will be individually reviewed through different RFCs. We hope that we will be able to evaluate all concerns during these reviews. If we make mistakes favoring one or the other framework, we hope to alleviate these through addition of new symbols, and complete redesigns with the next major version releases.
* Each package will set version boundaries for each of their dependencies. | ||
* Each package is responsible for ensuring that all of their public APIs are working without any changes until the next major release | ||
* Packages do not need to modify the minimum version requirements unless they start using newly introduced public API symbols. | ||
* TF metapackage releases may choose to hold back individual packages in favor of faster releases. But dependency requirements have to be respected when doing so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What Validation will TF Core team do on plugins to ensure interoperability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a select number of plugins, we will run continuous integration tests. For Filesystems, these will be as simple as reading and writing files. For ops and kernels, we will run convergence and performance testing on select models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know what select number of plugins
includes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, "select number of plugins" will be the "features currently officially supported by TF", such as CUDA, cloud file systems, GRPC based networking.
This list may grow or shrink over time.
* tf-gpu: With GPU only, for different operating systems. | ||
* tf-estimator: Only for different python versions | ||
|
||
When testing a package that has dependencies, such as tf-estimator, or tf-gpu, tensorflow-base will be installed with its latest stable release, to ensure to avoid any flakes by this package. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will CI be done? When someone submits a PR to TF Core, can plugins trigger CI to ensure that the PR doesn't cause regressions in the plugin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each plugin shall manage their depending TF version by itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: replied to the wrong comment. Moved the previous comment here to the docs question below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, our goal is for each repository to run their own unit tests only for PRs. Currently, TF's situation is becoming unmanageable, and it is a big developer concern on TF side.
Plugins will rely on TF core to keep its API syntactically and semantically compatible. Then they will run their own tests.
If the integration tests after PRs are merged uncover backwards incompatible changes in TF, or the other dependencies in the ecosystem these will be rolled back.
* tf-gpu: With GPU only, for different operating systems. | ||
* tf-estimator: Only for different python versions | ||
|
||
When testing a package that has dependencies, such as tf-estimator, or tf-gpu, tensorflow-base will be installed with its latest stable release, to ensure to avoid any flakes by this package. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And vice versa: when someone submits PR to Plugin, how does TF Core prevent regression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If such regressions can be detected with unit tests, those would be the detectors at presubmit time. If not, we will run these after merging changes. Most of our regression tests run after merging PRs right now.
|
||
* tensorflow-base: Operating systems, compiler versions and python versions only with CPU | ||
* tf-gpu: With GPU only, for different operating systems. | ||
* tf-estimator: Only for different python versions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who will own publication of Tensorflow + MKL binaries?
- anaconda packages
- pypi packages
- container images
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just describing from the current state (as official builds do not have MKL enabled yet), TF team releases main TF package, and Intel releases tf-mkl plugin. Again, this is reflecting only the current state, and not taking into account any potential changes to mkl support in TF. We can evaluate those changes orthogonally. To clarify the 3 points you raised in the above described scenario:
- Anaconda packages: TF team does not own any anaconda packages. We are separately evaluating anaconda, so it is an orthogonal issue.
- pypi packages:
tensorflow
pypi package is owned by TF team, andtensorflow-mkl
is a plugin owned by Intel. - container images: same as pypi packages, but Intel has the option to extend the TF container images just with
pip install tensorflow-mkl
command to build the MKL images on top of TF images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @agramesh1
``` | ||
|
||
|
||
When this plugin is loaded by TF at runtime, `TF_InitPlugin` method will be called. This method will register the plugin as prescribed by the module, and exit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will conflicts and priority be decided between plugins (e.g. CPU and GPU)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will try to avoid conflicts.
For ops, we have this rough idea: Each op plugin is required to use a separate "namespace" assigned to them for creating new ops. But we will have to improve on this idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarification: if two plugins register the ability to handle the same op (or whatever granularity of operation), how will priority be decided between them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation of the kernel registry, you can add extra labels to each kernel being registered, and use those to select which kernels you would like to pick:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/kernel_def.proto#L35
Our initial idea is to leverage this, but @sjamesr may have other ideas.
|
||
#### Scenario 3: C and B depend on different minimum versions of A | ||
|
||
As both C and B have to define a range of versions they require from A, the max version should satisfy both constraints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do plugins catch these dependency chain conflicts with sufficient time to resolve them before a release cycle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of conflicts did you have in mind?
As described here, when we set the max required version to the next major release, plugins can release independently, and as each plugin is independently released pip should be able to take care of satisfying the constraints for 'A'. with the max required A version the same for all plugins, the requirement intervals of versions of 'A' are guaranteed to have an intersection.
|
||
This is a pip issue. To help diagnose the problem, B and C should print that A is missing, and user needs to install that to use B or C. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who owns documentation for plugins? Is there plan for a shared site with plugin contribution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each plugin resides in its own repository, so does the doc site (separated from TF core docs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
1. These cannot export any symbols into global namespace to avoid symbol collisions. | ||
1. They need to communicate to TF through any provided C APIs. | ||
1. They can link anything they like, anyway they like in addition to TF. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For MKL, right now we depend on many cc_library(s) from core in bazel build files. Considering a separate MKL plugin, would we be able to do the same? Or will the core build files somehow expose the libraries to depend on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are right now working to create a C API for all libraries accessible through OpKernel and OpKernelContext classes. We are hoping that will cover all the required libraries for kernels. We can discuss if MKL kernels need more symbols.
#### Long build times | ||
|
||
Many volunteers and developers outside Google use their laptops for development. On such systems, TF development cycles require building all of tensorflow that takes around 2 hours. While there is an argument to be made that bazel caching should ensure that they build all of TF only once, without forge, bazel caching is not working as [well as expected](https://docs.bazel.build/versions/master/remote-caching.html#known-issues). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new design, is using bazel still a must for building the separate plugins (like MKL) or other build tools (like cmake) can be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plugins that are not in the main tensorflow repository are free to use any build system they would like.
1. These cannot export any symbols into global namespace to avoid symbol collisions. | ||
1. They need to communicate to TF through any provided C APIs. | ||
1. They can link anything they like, anyway they like in addition to TF. | ||
1. They can be built and distributed separately |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can different compilers other that GCC be used to build plugins (ex: Intel ICC compiler)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But when using different compilers, it becomes very important to only use C symbols at the ABI level.
|
||
#### Flexibility for collaborators | ||
|
||
Currently, any partner or contributor that would like to work with us are subject to all the rules within of the main repository. Some of these can be relaxed through modularization, where work can happen in a separate repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What and where do we plan on hosting the split up modular repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be up to the owners of each split repository.
They can be under github tensorflow org, they can be closed source, they can be on github and different org, or hosted on different websites.
|
||
More can be added to this list where we have to rebuild TensorFlow to support different network architectures, CUDA versions, SIMD instruction sets, etc. | ||
|
||
Having a monolithic repository means we need to rebuild all of our code for all of these different combinations. However, it makes no sense to rebuild all of our C++ code if the only difference is the Python version. Or rebuild all of our CPU kernels for different CUDA versions. Modularizing our code means we only need to rebuild and test the modules that are directly impacted by the dimensions we are changing in the support matrix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you touch up a little more on this PR process envisioned with this change.
Currently Intel push changes to google and google owns the process of verifying for
- coding standards
- API compatibility
- backwards compatibility
- Unit and regression tests.
- others
Now for future PR's for Intel MKL repo, will Intel police its own PRs into this separated module Repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Intel would like to own the MKL repository, yes, intel will have full control over all of these. You will have to provide backwards compatibility on your public APIs.
Each repository owner will have full control over their repository.
For regression testing, we will work together to make sure the continuous integration of Base TF packages and all plugins will continue to work together.
* Architecture (x86, arm, ppc64, …) | ||
* Accelerator (CPU, GPU, TPU) | ||
* Compiler (GCC, Clang, MSVC) | ||
* Python version (2, 3.4, 3.5, 3.6, 3.7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the modularized approach,(Lets say for the Intel TF-MKL Module:)
- who will be running these tests for these support matrix or we care only about the Linux environment alone?
- Who Kicks off the CI process to test MKL. Intel or Google.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For MKL module, assuming Intel taking full ownership of the MKL module (which we can discuss), will be intel's decision for both. Intel only wants to support Linux, sure. Intel wants to support all operating systems, sure.
TF and MKL presubmits will be independent, and their CIs will be owned by the repository owners. The integration tests will be a collaboration.
* Compiler (GCC, Clang, MSVC) | ||
* Python version (2, 3.4, 3.5, 3.6, 3.7) | ||
|
||
More can be added to this list where we have to rebuild TensorFlow to support different network architectures, CUDA versions, SIMD instruction sets, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we handle TF-MKL PR by outsiders? What should Intel do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be entirely intel's call.
Intel may choose to completely close the source of the MKL plugin, or manage the plugin code just like another project, just like this one: https://github.com/intel/mkl-dnn
* Major releases still need to be coordinated. | ||
|
||
|
||
## Packaging Issue Scenarios |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we maintain release dependence?
Intel syncs dot releases with the major release of Google?
If Google release TF 3.0 with MKL 2.17 dependency, the next MKL is 2.18 (working with TF 2.0 and TF 3.0) or TF-mkl release 2.18 or 3.1 or both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what you mean. Main TF package can either depend on MKL, or be depended on MKL. We will not accept a circular dependency.
If TF depends on MKL, MKL will depend on tensorflow-base. If MKL chooses to depend on TF, that will work like an add on.
What we would like to do is, just like described here, enable everyone to do independent releases. Therefore, avoiding circular dependencies is a must.
What is the status of this? |
It is work in progress, the first module will be the filesystems (#101) |
Is this megiable or not? |
It will be merged, but I have to edit some of the sections. |
@gunan any way we could resolve this RFC? |
Can we try to extend the deadline if needed? It Is still set on 2019-04-04 |
Relax API requirements.
Comment period is open until 2019-04-04
Modular TensorFlow
Objective: Propose an architecture where smaller, more focused TensorFlow modules can be created, maintained and released separately.