-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce StableHLO evolution process #196
Conversation
HLO/MHLO are supported by a wide variety of ML frameworks and compilers, including IREE, JAX, ONNX, PyTorch, TensorFlow and XLA. With StableHLO, we are aiming to build on this success and create an amazing portability layer between ML frameworks and ML compilers. To that end, we are establishing the StableHLO evolution process to provide StableHLO users with well-defined means of following and influencing the evolution of the StableHLO opset and accompanying infrastructure (the CHLO opset, bytecode serialization/deserialization, etc).
Having participated in similar discussions and efforts in LLVM, I have a few general comments that may be helpful. Overall, it sounds like any other project process, but it may need a rewrite if you want this to be an upstream process. Some of the artificial constrains you set don't connect well with the reality of our industry if you want to drive a community project. Inside companies, where the hierarchy, time frames and goals are aligned, you can make people focus on your proposals. Upstream, not only you cannot make people do anything, but people may be on holidays, or in critical paths on their own products, or through some personal problems, which makes it hard for them to participate. Initially this wont be a problem here, given that the governance and most people are within Google. But if you retain such tight control of the proposal process, you'll find that no one else outside of Google (or those with strong vested interest) will pay attention to your project, and this will become MHLO++ without community efforts. First, setting 15 days after an RFC is asking for trouble. What if a member in China puts up an RFC just before the Christmas holidays? Or the other way round, a US member do that on the Chinese new year? People have tried to set reasonable time boundaries in LLVM and they all have some problem, with recent examples on the community side of it (mailing list, discourse, etc) being a total disaster. Second, it sounds from this document that "request for change" can only happen at the end of this grace period and it's up to the "process lead" to request them. If you want a vibrant community, you'll need to include them here as the main drivers, not bystanders. Reviews should be asked and updates should be performed during the period, by everyone in the community. An active review converges much faster, and attracts a lot more people. Third, it seems that the "process lead" can accept and reject (in which case the RFC is closed), without or despite reaching consensus. Sure you can add that to the rules, but that means my will to participate has gone very close to zero. If the LLVM community is any indication of an "average compiler person", then I bet most people will feel like I do. Finally, you tie the proposal text with the proposal prototype/implementation. These have very different review styles and life times. If you bind them in one review, bike-shedding in the doc can drag a perfectly good technical implementation from ever merging, or implementation details can prevent proposals from being accepted, wasting time of the wrong crowd and making it likely that the proposal be refused just to stop that. |
Hey Renato - can I ask a meta-question: do you have any ideas on how to have this kind of project governance discussion (that have worked) from a standing start? I ask because that is the situation we are in and are looking for feedback there. It feels a little bit like we are trying to build a wing here without the whole airplane that it connects to. A couple of opinions:
At the openxla level, we are likely going to be introducing an interim governance model meant to hold us for the moment and help set bounds on conversations like this one. What we are actually aiming for is a bit TBD, but I personally like the hierarchical model of PyTorch's new governance and we are working on gaming something like that out a bit more (I like it specifically because it has a sense of locality for disagreements and defined processes for leadership and all of the things that can go wrong with people and power). My 2 cents. |
So, I'm not particularly fond of the way we do things in LLVM for the reasons you outlined, but it's not easy to make it better without some sort of stronger rules and LLVM doesn't do well with stronger rules because no one wants the other side to be stronger... sigh. There are good examples that work (for some definition) but have their own issues. FreeBSD, Debian, Linux have a strongly encoded model and that works for the people involved in those projects and people that surround them don't seem to find them too hard to grasp (I do). GCC and LLVM are more hazy on the details and people often get confused. To me, it's more a matter of "how many internal users you got" than how large your community is. LLVM and GCC have a very different set of users (downstream, upstream, rebuilders, academic, etc) and that's hard to accommodate everyone. Linux, despite probably being the largest user-base in the world, is more monolithic in usage, but it also has rules that are hard to grasp. I imagine PyTorch, OpenXLA would have a much simpler set of rules and considerably less internal users (API, ABI) and more external users (like the Linux kernel), so easier to encode and understand. To that effect, I'd only recommend minor changes to the wording to make it less "every-day top-down" but still having a hard stop when needed. For example:
How we did in LLVM for proposals (RFC + document in a folder) could be improved by having a second folder for "approved proposals" even if they're not being actively worked on right now, so that people can work towards those goals even if there isn't a concrete implementation, knowing that eventually, there will be one. In the end, if you want a community-driven project you have to give power to the community. Otherwise, it's just another It's a bit like Khronos group versus Linux Foundation. Neither models are perfect, but the former is tailored towards "closed" open standards while the latter tries to do "open" open standards. Both works for what they work, you just have to pick your side and stick with it. |
Side note: The original text may have been written with the intent of the changes I proposed, but it's not clear and it does not send that message to me, in particular. It may be just a matter of re-writing it to be more clear, than changing the original intentions. |
Renato, thanks a lot for your feedback! With StableHLO (and, more generally, OpenXLA), we're growing new communication channels, and it is so nice that you're taking the time and effort to help establish them. "Side note: The original text may have been written with the intent of the changes I proposed". Thank you for saying this! To the best of my understanding of your proposal, the intent of this PR is indeed pretty close to what you in have in mind, and your comment provides a great way to build off of that. What I aimed to achieve with this PR is to propose a process which encourages community consensus but also provides additional structure to resolve the hopefully unlikely situations where the consensus cannot be reached. However, I can totally see how the current version of the PR can be read as "not much can happen without the involvement from the process lead, and they can do whatever they want including overriding community consensus". The PR needs some updates to make sure it better reflects the intent. I also agree with other specific details of your feedback, e.g. that the 14-day timer is too rigid, and that it would be better to separate RFCs from implementations. Let me spend some time wordsmithing, and I'll push an update to this PR to reflect this. |
As the *HLO roots are in the TF repository I really hope that we could also take care of what worked and what not with years of RFCs in https://github.com/tensorflow/community. From the public data on GitHub one of the issue of that process is that statistically we had very few RFCs by independent contributors and few by "enterprise" external contributors (e.g. hw vendors etc.) My perception was that the barrier was too high to submit an RFC and often it was hard to attract enough attention also just commenting in existing RFCs. More in general at some point we had also a sort of disconnection between proposals and implementations. As this is a brand new effort I hope that we could improve on this specially if we are still interested in independent contributors as often they could have less resources to spend in too complex formalizations than "enterprise partners". Lower then preparing a formal RFC I really like if we could have a sort of sandbox space where we could really have a fast-check about the compositional limit of the current ops set related to a proposed computation. |
As much as I'm in favor of doing better than Tensorflow, the thesis here is not correct at all: these are two completely different projects that happened to be commingled at the repository level. There is virtually nothing about what constitutes an op in Tensorflow that will apply to StableHLO. StableHLO is ultimately a compiler frontend opset with concessions made for stability and interop over time. It needs principles much closer to how LLVM operates in terms of design, and I believe, an evolution process that has some more formality in terms of ownership and dispute resolution. I think the opportunity we have here is to offer a bit more written guidance about those principles and processes vs how LLVM operates -- which takes a much more "unwritten common law" approach that requires a deep understanding of the totality of the history of the project in order to interpret how a decision gets made. The goal is ultimately to ensure that StableHLO has operations sufficient to build compilers that can map all in scope framework operations to all in scope hardware implementations. I'm purposefully leaving some of those terms under defined here because I think they point at us writing down more of the principles (or accumulating more "case law"). Contributions will of course be accepted by all but the bar is intentionally high and arbitrated by the community of people who have developed the standing to evaluate the design. I think that such an evolution process as this needs to primarily define how one develops that standing, the day to day processes for making changes, what the dispute resolution hierarchy is, and links back to the design principles that will be applied (which are themselves codified and extended over time via the evolution process but with an even higher bar). |
History chose that placement. Then if in its origins it never used the RFC process defined for the whole repository is another matter.
I am not saying to lower the standards of the ops to include here but in the end we need to be able to express tensor computation or not? As I was reasoning e.g. in a classical bridge kernel about how to use/pilot the HLO API-Ops to express my computation in the end I think it would be quite similar to how to express it with StableHLO So I wish you could find a space in this project to support or dissolve doubts about what user think are hard to express computations. The alternative or probably what is classically expected is to explore this path Topdown. Starting from the frameworks, passing through the bridges, then going to (Open) XLA and eventually ping StableHLO. But very often this process already stops at the framework level from an end-user point of view. Confirming that the computation is feasible at the StableHLO operators level could help to quickly shift the focus and possibly the contribution to OpenXLA or to the bridges with a bottom up approach. Then maybe we don't have the resources here to do this but I think it is still something interesting to explore. |
My last point is also partially related to: Expecially referring to MHLO:
I don't think it is strictly in the exclusive ownership of XLA or StableHLO but for sure OpenXLA is involved. |
I don't know what to tell you: there are a lot of ways that StableHLO needs to evolve, based both on things that we need now and things we will need in the future. The governance being defined here is about how to manage that evolution, not any specific thing. I expect that both the frameworks above and the compiler/hardware makers below will need to ensure that they have sufficient representation and influence in OpenXLA overall and StableHLO specifically in order to arbitrate that over time. Since the goal of the OpenXLA project itself is to bridge those worlds, I expect that the incentives and goals are aligned so that this can work. I'm having trouble following exactly what you are suggesting but I think you want to see some kind of different "ownership" structure that apportions influence in some kind of top down fashion across different parts of the ecosystem. I don't think we're going to solve those things with governance or ownership structure (the "how"). Some of what you are bringing up is more appropriate in defining project design principles (the "why") and specific projects that need to be done to move things forward (the "what"). Those last two will be defined over time by the community of contributors in accordance with the overall project goals and principles. |
Not exactly, it was partially connected to the "exclusive" RFC nature of the evolution. It was quite "exclusive" to propose an RFC and collect enough sponsorship/attention in the root repo also if probably the XLA/HLO subproject has never used it in these years (so we don't have specific datapoints for these specific components). But as we have discussed in this thread the high(er) standards here are not going to lowering this barrier. So my point is: Can we have also a more informal pre-evaluation of evolution or a ticketing process more accessible to end users without going down to the full stack from the "consumers" frameworks down to the bridges, down to the (Open)/XLA? In the end if I find hard to compose my computation with *HLO but I don't have all the formalism to prepare an RFC in the repo, cause it is not my daily activity, I want to really have a space/process to discuss or to eventually deconstruct evolution needs for StableHLO without:
I think a pseudo-computation could be already formulated in term of StableHLO available ops. At least it was possible with the classical HLO opset/API. I hope I have said something that makes sense |
I don't think that is connected. Discussions, forums and other manner of collaboration are disconnected from the process of actually making consequential changes (ie. An RFC). In fact, such discussions ime, are essential to properly getting to the bottom of defining something well (prior to an RFC or general consensus that changes are not consequential enough to need one). Nothing in the evolution doc precludes such discussions or even more formal things like office hours, standing meetings on topics of interest, etc. One of the biggest mistakes I see in a lot of projects is people thinking that they just need to "go it alone" on consequential RFCs and make a one shot submission of something on their own. Many of the most successful project evolutions I have seen do the opposite (and that is outside of the scope of the process): a group of people with an aligned interest find each other and have enough discussion to decide to make something happen. Once it is baked enough, they raise an RFC with the project(s). Since familiarity breeds consensus, these kinds of exercises often produce the best results. All of that falls into the bucket of how you bootstrap good community norms. This document is just about the mechanics of actual proposing and making changes (and dealing with some of the issues that happen when humans interact on such things). |
Nothing precludes but I want to have something about this in the evolution process itself. We could have a reference that we welcome, in the pre-evolution formalization, the discussion of end users hard to compute tensor computation directly. This is just my proposal, but if you prefer a pure top down driven approach, it could be still ok but by my point of view is that generally end users are lost in the top down path. |
I'd prefer to keep this actual process definition focused on the mechanics. It was mentioned above somewhere but links to more open ended norms and principles docs would be good to include at the top. If we had those, they would be easily referenced in an intro paragraph about what the scope of this doc is and isn't. |
|
||
## Scope | ||
|
||
This process governs the evolution of StableHLO and CHLO opsets and bytecode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would qualify it as "minimally applying to". I would also note that general consensus in a design discussion or code review can conclude that any topic requires an RFC, at which point, this process applies.
While rarer, general consensus can also conclude that something which may technically appear to require an RFC is not of consequence and should just be dealt with by normal code review processes. (Ime, it is pretty common for newer community members to feel that what they are doing rises to the bar of requiring an RFC but the more seasoned members see it is trivial or in kind with something already decided)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think that something like this could be good enough.
Ok we could continue this topic at openxla/xla#17 I suppose that also StableHLO members could comment there. |
Closing this pull request in favor of the upcoming discussion of the more general OpenXLA governance process. As shared by @theadactyl at the 11/15 community meeting, we're working on formalizing OpenXLA governance, and this pull request and the discussion around it (thank you, everyone, for your contributions!) will feed into that. |
HLO/MHLO are supported by a wide variety of ML frameworks and compilers, including IREE, JAX, ONNX, PyTorch, TensorFlow and XLA. With StableHLO, we are aiming to build on this success and create an amazing portability layer between ML frameworks and ML compilers.
To that end, we are establishing the StableHLO evolution process to provide StableHLO users with well-defined means of following and influencing the evolution of the StableHLO opset and accompanying infrastructure (the CHLO opset, bytecode serialization/deserialization, etc).