Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce StableHLO evolution process #196

Closed
wants to merge 1 commit into from
Closed

Introduce StableHLO evolution process #196

wants to merge 1 commit into from

Conversation

burmako
Copy link
Contributor

@burmako burmako commented Sep 27, 2022

HLO/MHLO are supported by a wide variety of ML frameworks and compilers, including IREE, JAX, ONNX, PyTorch, TensorFlow and XLA. With StableHLO, we are aiming to build on this success and create an amazing portability layer between ML frameworks and ML compilers.

To that end, we are establishing the StableHLO evolution process to provide StableHLO users with well-defined means of following and influencing the evolution of the StableHLO opset and accompanying infrastructure (the CHLO opset, bytecode serialization/deserialization, etc).

HLO/MHLO are supported by a wide variety of ML frameworks and compilers,
including IREE, JAX, ONNX, PyTorch, TensorFlow and XLA. With StableHLO,
we are aiming to build on this success and create an amazing portability
layer between ML frameworks and ML compilers.

To that end, we are establishing the StableHLO evolution process to
provide StableHLO users with well-defined means of following and
influencing the evolution of the StableHLO opset and
accompanying infrastructure (the CHLO opset, bytecode
serialization/deserialization, etc).
@burmako burmako added the RFC label Sep 27, 2022
@burmako burmako self-assigned this Sep 27, 2022
@rengolin
Copy link

Having participated in similar discussions and efforts in LLVM, I have a few general comments that may be helpful.

Overall, it sounds like any other project process, but it may need a rewrite if you want this to be an upstream process.

Some of the artificial constrains you set don't connect well with the reality of our industry if you want to drive a community project. Inside companies, where the hierarchy, time frames and goals are aligned, you can make people focus on your proposals. Upstream, not only you cannot make people do anything, but people may be on holidays, or in critical paths on their own products, or through some personal problems, which makes it hard for them to participate.

Initially this wont be a problem here, given that the governance and most people are within Google. But if you retain such tight control of the proposal process, you'll find that no one else outside of Google (or those with strong vested interest) will pay attention to your project, and this will become MHLO++ without community efforts.

First, setting 15 days after an RFC is asking for trouble. What if a member in China puts up an RFC just before the Christmas holidays? Or the other way round, a US member do that on the Chinese new year? People have tried to set reasonable time boundaries in LLVM and they all have some problem, with recent examples on the community side of it (mailing list, discourse, etc) being a total disaster.

Second, it sounds from this document that "request for change" can only happen at the end of this grace period and it's up to the "process lead" to request them. If you want a vibrant community, you'll need to include them here as the main drivers, not bystanders. Reviews should be asked and updates should be performed during the period, by everyone in the community. An active review converges much faster, and attracts a lot more people.

Third, it seems that the "process lead" can accept and reject (in which case the RFC is closed), without or despite reaching consensus. Sure you can add that to the rules, but that means my will to participate has gone very close to zero. If the LLVM community is any indication of an "average compiler person", then I bet most people will feel like I do.

Finally, you tie the proposal text with the proposal prototype/implementation. These have very different review styles and life times. If you bind them in one review, bike-shedding in the doc can drag a perfectly good technical implementation from ever merging, or implementation details can prevent proposals from being accepted, wasting time of the wrong crowd and making it likely that the proposal be refused just to stop that.

@stellaraccident
Copy link

Hey Renato - can I ask a meta-question: do you have any ideas on how to have this kind of project governance discussion (that have worked) from a standing start? I ask because that is the situation we are in and are looking for feedback there. It feels a little bit like we are trying to build a wing here without the whole airplane that it connects to.

A couple of opinions:

  • Outside of an actual standards process, highly proscribed roles/rules for contributions are probably not going to work (noting your objection to a firm time period, etc), and we have to be in a mode where consensus of the stakeholders is real and happening mostly by default as a normal mode of operation. To extend your example, if the right contributor is out for Chinese New Year, we wait (and to the extent possible plan for that). If that is an extended leave, then some arrangement has to be made. Most real cases will fall somewhere in the middle.
  • I will say that with a fair amount of LLVM history, I can tell you that the "old hats" understand the flow but newcomers are often put off by not really understanding how decisions get made. I think there is room to document norms for such things (vs hard and fast rules) so that people have a better feeling about how the project should be running in non-exceptional situations.
  • I'm keenly interested in what happens when the norms don't work, decisions get made in error, consensus can't emerge etc -- the failure cases. Personally, for a project like this, I'm in favor of models where there is a hierarchy of escalation for such things and everyone who has developed standing in the ecosystem has a shot at being an official part of such resolution processes (i.e. if there is a "project lead", that job is actually open to anyone who has developed standing -- for real and not just on paper).

At the openxla level, we are likely going to be introducing an interim governance model meant to hold us for the moment and help set bounds on conversations like this one. What we are actually aiming for is a bit TBD, but I personally like the hierarchical model of PyTorch's new governance and we are working on gaming something like that out a bit more (I like it specifically because it has a sense of locality for disagreements and defined processes for leadership and all of the things that can go wrong with people and power).

My 2 cents.

@rengolin
Copy link

So, I'm not particularly fond of the way we do things in LLVM for the reasons you outlined, but it's not easy to make it better without some sort of stronger rules and LLVM doesn't do well with stronger rules because no one wants the other side to be stronger... sigh.

There are good examples that work (for some definition) but have their own issues. FreeBSD, Debian, Linux have a strongly encoded model and that works for the people involved in those projects and people that surround them don't seem to find them too hard to grasp (I do). GCC and LLVM are more hazy on the details and people often get confused. To me, it's more a matter of "how many internal users you got" than how large your community is.

LLVM and GCC have a very different set of users (downstream, upstream, rebuilders, academic, etc) and that's hard to accommodate everyone. Linux, despite probably being the largest user-base in the world, is more monolithic in usage, but it also has rules that are hard to grasp. I imagine PyTorch, OpenXLA would have a much simpler set of rules and considerably less internal users (API, ABI) and more external users (like the Linux kernel), so easier to encode and understand.

To that effect, I'd only recommend minor changes to the wording to make it less "every-day top-down" but still having a hard stop when needed. For example:

  • Change the RFC cycle to something similar to a release schedule, with "release candidates" worked on, then waiting periods, cycling back if problems are risen, or "releasing" (ie. accepting the proposal) if not. This gives a nice converging nature without being hard on times (and no need to add exceptions for each national holiday).
  • Make the role of the "process lead" to be a "last resort", not a mandated step. If the community reaches consensus on its own, there's no need for a lead making decisions. If the community doesn't, then the selected lead has the power to sway things in the direction of the project's charter.
  • Split the RFC process into proposal and implementation, or at least allow one to be merged without the other. Putting it another way, make sure a bad reference implementation doesn't hold a perfectly valid proposal from being worked on. An example of this is the bike-shedding we're having on where to put the TCP dialect and the contention it's having on actually having such a dialect.

How we did in LLVM for proposals (RFC + document in a folder) could be improved by having a second folder for "approved proposals" even if they're not being actively worked on right now, so that people can work towards those goals even if there isn't a concrete implementation, knowing that eventually, there will be one.

In the end, if you want a community-driven project you have to give power to the community. Otherwise, it's just another <Company-name> project on Github. You may have collaborators with the same agenda, but that's not a community.

It's a bit like Khronos group versus Linux Foundation. Neither models are perfect, but the former is tailored towards "closed" open standards while the latter tries to do "open" open standards. Both works for what they work, you just have to pick your side and stick with it.

@rengolin
Copy link

Side note: The original text may have been written with the intent of the changes I proposed, but it's not clear and it does not send that message to me, in particular. It may be just a matter of re-writing it to be more clear, than changing the original intentions.

@burmako
Copy link
Contributor Author

burmako commented Sep 27, 2022

Renato, thanks a lot for your feedback! With StableHLO (and, more generally, OpenXLA), we're growing new communication channels, and it is so nice that you're taking the time and effort to help establish them.

"Side note: The original text may have been written with the intent of the changes I proposed". Thank you for saying this! To the best of my understanding of your proposal, the intent of this PR is indeed pretty close to what you in have in mind, and your comment provides a great way to build off of that.

What I aimed to achieve with this PR is to propose a process which encourages community consensus but also provides additional structure to resolve the hopefully unlikely situations where the consensus cannot be reached. However, I can totally see how the current version of the PR can be read as "not much can happen without the involvement from the process lead, and they can do whatever they want including overriding community consensus". The PR needs some updates to make sure it better reflects the intent.

I also agree with other specific details of your feedback, e.g. that the 14-day timer is too rigid, and that it would be better to separate RFCs from implementations. Let me spend some time wordsmithing, and I'll push an update to this PR to reflect this.

@bhack
Copy link

bhack commented Oct 15, 2022

As the *HLO roots are in the TF repository I really hope that we could also take care of what worked and what not with years of RFCs in https://github.com/tensorflow/community.

From the public data on GitHub one of the issue of that process is that statistically we had very few RFCs by independent contributors and few by "enterprise" external contributors (e.g. hw vendors etc.)

My perception was that the barrier was too high to submit an RFC and often it was hard to attract enough attention also just commenting in existing RFCs.

More in general at some point we had also a sort of disconnection between proposals and implementations.

As this is a brand new effort I hope that we could improve on this specially if we are still interested in independent contributors as often they could have less resources to spend in too complex formalizations than "enterprise partners".

Lower then preparing a formal RFC I really like if we could have a sort of sandbox space where we could really have a fast-check about the compositional limit of the current ops set related to a proposed computation.
I don't know if this could be done on a voluntary basis in a specific forum section or somewhere else but it would be very helpful.
When an user need to start to think at a specific computation looking at the populated ops table available in a specific datetime in StableHLO it could be nice to have this sandbox if the case the computation seems unreachable.
So before proposing any extension with an RFC and all the related overhead sometime I want to just check if we have something missing or not to express a specific computation (I mean other the relying on while loos ops that at last was historically the most general but also one of the slowest solution in HLO - correct me if I am wrong).

@stellaraccident
Copy link

stellaraccident commented Oct 15, 2022

As the *HLO roots are in the TF repository I really hope that we could also take care of what worked and what not with years of RFCs in https://github.com/tensorflow/community.

As much as I'm in favor of doing better than Tensorflow, the thesis here is not correct at all: these are two completely different projects that happened to be commingled at the repository level. There is virtually nothing about what constitutes an op in Tensorflow that will apply to StableHLO.

StableHLO is ultimately a compiler frontend opset with concessions made for stability and interop over time. It needs principles much closer to how LLVM operates in terms of design, and I believe, an evolution process that has some more formality in terms of ownership and dispute resolution. I think the opportunity we have here is to offer a bit more written guidance about those principles and processes vs how LLVM operates -- which takes a much more "unwritten common law" approach that requires a deep understanding of the totality of the history of the project in order to interpret how a decision gets made.

The goal is ultimately to ensure that StableHLO has operations sufficient to build compilers that can map all in scope framework operations to all in scope hardware implementations. I'm purposefully leaving some of those terms under defined here because I think they point at us writing down more of the principles (or accumulating more "case law").

Contributions will of course be accepted by all but the bar is intentionally high and arbitrated by the community of people who have developed the standing to evaluate the design. I think that such an evolution process as this needs to primarily define how one develops that standing, the day to day processes for making changes, what the dispute resolution hierarchy is, and links back to the design principles that will be applied (which are themselves codified and extended over time via the evolution process but with an even higher bar).

@bhack
Copy link

bhack commented Oct 15, 2022

these are two completely different projects that happened to be commingled at the repository level.

History chose that placement. Then if in its origins it never used the RFC process defined for the whole repository is another matter.

There is virtually nothing about what constitutes an op in Tensorflow that will apply to StableHLO

I am not saying to lower the standards of the ops to include here but in the end we need to be able to express tensor computation or not?

As I was reasoning e.g. in a classical bridge kernel about how to use/pilot the HLO API-Ops to express my computation in the end I think it would be quite similar to how to express it with StableHLO

So I wish you could find a space in this project to support or dissolve doubts about what user think are hard to express computations.

The alternative or probably what is classically expected is to explore this path Topdown. Starting from the frameworks, passing through the bridges, then going to (Open) XLA and eventually ping StableHLO.

But very often this process already stops at the framework level from an end-user point of view.

Confirming that the computation is feasible at the StableHLO operators level could help to quickly shift the focus and possibly the contribution to OpenXLA or to the bridges with a bottom up approach.

Then maybe we don't have the resources here to do this but I think it is still something interesting to explore.

@bhack
Copy link

bhack commented Oct 16, 2022

My last point is also partially related to:
openxla/xla#17 (comment)

Expecially referring to MHLO:

Unfortunately, sometimes these operations (and compositions of these operations) are not sufficiently expressive

I don't think it is strictly in the exclusive ownership of XLA or StableHLO but for sure OpenXLA is involved.

@stellaraccident
Copy link

I don't think it is strictly in the exclusive ownership of XLA or StableHLO but for sure OpenXLA is involved.

I don't know what to tell you: there are a lot of ways that StableHLO needs to evolve, based both on things that we need now and things we will need in the future. The governance being defined here is about how to manage that evolution, not any specific thing. I expect that both the frameworks above and the compiler/hardware makers below will need to ensure that they have sufficient representation and influence in OpenXLA overall and StableHLO specifically in order to arbitrate that over time. Since the goal of the OpenXLA project itself is to bridge those worlds, I expect that the incentives and goals are aligned so that this can work.

I'm having trouble following exactly what you are suggesting but I think you want to see some kind of different "ownership" structure that apportions influence in some kind of top down fashion across different parts of the ecosystem. I don't think we're going to solve those things with governance or ownership structure (the "how"). Some of what you are bringing up is more appropriate in defining project design principles (the "why") and specific projects that need to be done to move things forward (the "what"). Those last two will be defined over time by the community of contributors in accordance with the overall project goals and principles.

@bhack
Copy link

bhack commented Oct 16, 2022

I'm having trouble following exactly what you are suggesting but I think you want to see some kind of different "ownership" structure

Not exactly, it was partially connected to the "exclusive" RFC nature of the evolution.

It was quite "exclusive" to propose an RFC and collect enough sponsorship/attention in the root repo also if probably the XLA/HLO subproject has never used it in these years (so we don't have specific datapoints for these specific components).

But as we have discussed in this thread the high(er) standards here are not going to lowering this barrier.

So my point is:

Can we have also a more informal pre-evaluation of evolution or a ticketing process more accessible to end users without going down to the full stack from the "consumers" frameworks down to the bridges, down to the (Open)/XLA?

In the end if I find hard to compose my computation with *HLO but I don't have all the formalism to prepare an RFC in the repo, cause it is not my daily activity, I want to really have a space/process to discuss or to eventually deconstruct evolution needs for StableHLO without:

  • Going down to the stack through the troubles of traversing 2/3 GitHub repos

  • Taking care of the overhead of a formal RFC

I think a pseudo-computation could be already formulated in term of StableHLO available ops. At least it was possible with the classical HLO opset/API.

I hope I have said something that makes sense

@stellaraccident
Copy link

Not exactly, it was partially connected to the "exclusive" RFC nature of the evolution.

I don't think that is connected. Discussions, forums and other manner of collaboration are disconnected from the process of actually making consequential changes (ie. An RFC). In fact, such discussions ime, are essential to properly getting to the bottom of defining something well (prior to an RFC or general consensus that changes are not consequential enough to need one).

Nothing in the evolution doc precludes such discussions or even more formal things like office hours, standing meetings on topics of interest, etc. One of the biggest mistakes I see in a lot of projects is people thinking that they just need to "go it alone" on consequential RFCs and make a one shot submission of something on their own. Many of the most successful project evolutions I have seen do the opposite (and that is outside of the scope of the process): a group of people with an aligned interest find each other and have enough discussion to decide to make something happen. Once it is baked enough, they raise an RFC with the project(s). Since familiarity breeds consensus, these kinds of exercises often produce the best results.

All of that falls into the bucket of how you bootstrap good community norms. This document is just about the mechanics of actual proposing and making changes (and dealing with some of the issues that happen when humans interact on such things).

@bhack
Copy link

bhack commented Oct 16, 2022

Nothing in the evolution doc precludes such discussions or even more formal things like office hours, standing meetings on topics of interest, etc.

Nothing precludes but I want to have something about this in the evolution process itself.
It is why I am commenting in this PR.

We could have a reference that we welcome, in the pre-evolution formalization, the discussion of end users hard to compute tensor computation directly.

This is just my proposal, but if you prefer a pure top down driven approach, it could be still ok but by my point of view is that generally end users are lost in the top down path.

@stellaraccident
Copy link

stellaraccident commented Oct 16, 2022

I'd prefer to keep this actual process definition focused on the mechanics. It was mentioned above somewhere but links to more open ended norms and principles docs would be good to include at the top. If we had those, they would be easily referenced in an intro paragraph about what the scope of this doc is and isn't.


## Scope

This process governs the evolution of StableHLO and CHLO opsets and bytecode,
Copy link

@stellaraccident stellaraccident Oct 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would qualify it as "minimally applying to". I would also note that general consensus in a design discussion or code review can conclude that any topic requires an RFC, at which point, this process applies.

While rarer, general consensus can also conclude that something which may technically appear to require an RFC is not of consequence and should just be dealt with by normal code review processes. (Ime, it is pretty common for newer community members to feel that what they are doing rises to the bar of requiring an RFC but the more seasoned members see it is trivial or in kind with something already decided)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think that something like this could be good enough.

@bhack
Copy link

bhack commented Oct 16, 2022

I'd prefer to keep this actual process definition focused on the mechanics. It was mentioned above somewhere but links to more open ended norms and principles docs would be good to include at the top. If we had those, they would be easily referenced in an intro paragraph about what the scope of this doc is and isn't.

Ok we could continue this topic at openxla/xla#17

I suppose that also StableHLO members could comment there.

@burmako burmako mentioned this pull request Oct 19, 2022
@burmako burmako mentioned this pull request Nov 5, 2022
@burmako
Copy link
Contributor Author

burmako commented Nov 24, 2022

Closing this pull request in favor of the upcoming discussion of the more general OpenXLA governance process. As shared by @theadactyl at the 11/15 community meeting, we're working on formalizing OpenXLA governance, and this pull request and the discussion around it (thank you, everyone, for your contributions!) will feed into that.

@burmako burmako closed this Nov 24, 2022
@burmako burmako deleted the evolution branch November 24, 2022 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants