Functions backwards compatibility #274

westonpace · 2022-08-03T00:53:23Z

Background:

A few community syncs back we discussed backwards compatibility for functions (e.g. yaml elements) at some length. I'd like to capture some of the key thoughts here:

Adding an option is unlikely to be backwards compatible, unless possibly if added at the very end, but given that most options are likely to be optional/enum, and those tend to be placed before value arguments, it seems unlikely.
Changes to properties (e.g. nullability) would not be backwards compatible
Adding new properties could be backwards compatible, provided the new property matched implicit assumptions that already existed.
Adding a kernel should be backwards compatible
Changes to the description could, in theory be backwards incompatible, even though no "technical content" changed.
We shouldn't try too hard to maintain backwards compatibility, just create a new function name at that point

We also discussed potentially versioning the URIs for functions:

Adding the version in the URI (or basing the URI on a commit hash) could allow for breaking changes but would but introduces the risk of artificial breaking changes. For example, if we bump the version every release then it will look like all functions had a potentially backwards incompatible change / name change even if they didn't. More significantly, plans written against v2 would not work against v3.
Removing the version from the URI means that any backwards incompatible change to a function would likely require a new name.

Finally, there was discussion about marking a function as "stable" / "sealed":

Before becoming stable, a function is liable to have backwards incompatible changes, without requiring a name change.
Once a function is stable, if any backwards incompatible change needs to be made, it would require a name change.
The stability could be implicit or it could be an explicit property of either the YAML or the name.

Proposal:

I imagine we will continue to learn things for a while (e.g. rounding modes) and we shouldn't consider any functions stable until at least 1.0, etc.
I think functions should be explicitly versioned. Version 0 should be implicitly unstable. Version 1 should generally be created as long as there are 2 consumers and 2 producers that support it.
Generally, higher version numbers are going to be "more precise behavior" and not "different behavior". For example, an "add v3" shouldn't be a variation on add but should replace "add v2" with more precise information.
Producers should be able to specify a "minimum version" by asking for the older version. For example, let's pretend we added rounding modes to floating arithmetic in version 2. A producer could have a user facing option that defaults to "unspecified rounding mode" and, if the user chooses that, then version 1 is acceptable. The producer would then supply the arguments for version 1.
This means that consumers should generally implement all versions of a function, and not just the latest versions. Typically this will only need extra mapping, and not actual separate implementations, although that may be required sometimes.
Versions should not be included in the URI but should either be a part of the name (add-2) or an independent property (would require changes to the protobuf to specify which version of the function is being executed). I'm leaning slightly in favor of using the name at the moment.

jvanstraten · 2022-08-15T10:19:09Z

Adding an option is unlikely to be backwards compatible, unless possibly if added at the very end

Nit: this is actually not the case currently, because the function name in the plan must include "opt" for each optional enum argument, and optional arguments actually have to be explicitly set to "unspecified" in the plan. (I think I didn't understand this yet back when we had this sync meeting)

For your proposal, it sounds a bit like you're reinventing semver. Why not just go the extra mile and use x.y.z versions for each function?

westonpace · 2022-08-15T17:42:11Z

For your proposal, it sounds a bit like you're reinventing semver. Why not just go the extra mile and use x.y.z versions for each function?

Can you give an example of what would constitute a minor change or a patch change? I think I was basically arguing that there is no way to do either so we shouldn't bother with those.

jvanstraten · 2022-08-15T20:53:08Z

Sure;

patch (no functional changes):
- fix a typo in a description or argument name
- elaborate upon a description because people didn't understand it (which is the closest thing to a bugfix IMO)
- add names or descriptions to arguments
minor (added features):
- add an option to the end of an existing enum
- add an implementation
- convert an implementation from DISCRETE nullability to DECLARED_OUTPUT, adding information about how to deal with the new implicit implementations
major (breaking change):
- remove an implementation
- change behavior by functionally modifying the description
- add an argument (effectively, this is just removing an implementation and then adding a different implementation)

westonpace · 2022-08-15T22:27:31Z

TL:DR; Thanks. I will concede that semver would be better. The one point I disagree on is that I think adding/removing implementations should not affect function versioning. Otherwise what will happen when new implementations are added for extension types to existing functions?

fix a typo in a description or argument name
elaborate upon a description because people didn't understand it (which is the closest thing to a bugfix IMO)
add names or descriptions to arguments

I had been thinking purely of producers & consumers. None of these changes would be something they care about (since producers and consumers don't really need the YAML). If a consumer works with version 1.2.3 and version 1.2.4 fixes a description typo then there would be zero reason to upgrade the consumer.

However, this logic doesn't apply for validators, visualizers, etc. In those cases the YAML is often presented to the end user (e.g. in a validator error message or visualizer display) and so it makes sense users would want the latest version installed.

add an option to the end of an existing enum

Very good example I had not thought of.

add an implementation

In my mind implementations were kind of independent things so adding an implementation would be going from version 0 (does not exist) to whatever the version of the function you are adding is. So there would be no need to raise the version in this case.

convert an implementation from DISCRETE nullability to DECLARED_OUTPUT, adding information about how to deal with the new implicit implementations

This is an incredibly subtle and beautiful example. I hope you at least had to wrack your brain for a while to come up with it. I agree.

remove an implementation

Again, I'm not sure I agree that function versioning should be bound to implementations.

jvanstraten · 2022-08-16T07:19:04Z

The one point I disagree on is that I think adding/removing implementations should not affect function versioning. Otherwise what will happen when new implementations are added for extension types to existing functions?

Again, I'm not sure I agree that function versioning should be bound to implementations.

You lost me with these statements. I think I may be making incorrect assumptions about where these version numbers are and what they are associated with. My assumption was that a function (i.e. at the name + description + implementation list level) gets a version number, and that a function extension declaration in the plan would include a version number that the actual function version must be compatible with, but now that I'm thinking about it more thoroughly I'm not sure how that would work with the URIs. Can you explain where you think the version numbers should be?

This is an incredibly subtle and beautiful example. I hope you at least had to wrack your brain for a while to come up with it. I agree.

Coincidentally I had given a very extensive explanation about how those nullability things work and what the edge cases are to Phillip a few hours prior, so it was fresh in my mind :)

westonpace · 2022-08-17T14:47:23Z

You lost me with these statements. I think I may be making incorrect assumptions about where these version numbers are and what they are associated with. My assumption was that a function (i.e. at the name + description + implementation list level) gets a version number, and that a function extension declaration in the plan would include a version number that the actual function version must be compatible with, but now that I'm thinking about it more thoroughly I'm not sure how that would work with the URIs. Can you explain where you think the version numbers should be?

My thinking was that it would not include implementations (e.g. at the same level as name + description). One approach could be in both the YAML and the protobuf:

# In the YAML
scalar_functions:
  -
    name: "add"
    description: "Add two values."
    version: "1.0.0"
    impls:
    ...

# Later, in the protobuf

  message ExtensionFunction {
    ...
    string name = 3;

    // The version of the function to call
    string version = 4;
  }

Another approach could be in the name (which conveniently, wouldn't require a protobuf change):

scalar_functions:
  -
    name: "add-1.0.0"
    description: "Add two values."
    impls:
    ...

From a consumer perspective the "in the name" approach was going to be automatic if we were just doing a single version number ("add-2" is simply a different function, no one needs to know the "-2" part represents the version). No parsing of the version would be required. Unfortunately, this does not really work as conveniently if we are going to consider patch versions (since "add-1.0.0" and "add-1.0.1" will map to the same function implementation). So in that case I would prefer the first, more explicit, approach.

I'm not certain it makes sense to version the implementation. Either an implementation exists for the function/version pair or it doesn't. Adding a new implementation does not require a version change nor does dropping an implementation.

For example, if I want to query a consumer to know what functions/implementations it supports I would not say "Do you support add-1.0.0?" and expect "yes" to mean that the consumer supports all implementations. Instead I would say "List the supported implementations for add-1.0.0" and then inspect the list.

jvanstraten · 2022-08-17T17:20:58Z

Okay, I think I now understand what you mean. My gut feeling was to disagree and argue in favor of "Do you support add-1.0.0? -> y/n", but I guess you have a point; if a hypothetical consumer just doesn't support i64 at all for whatever reason, it would never be able to satisfy any of the core arithmetic functions (as currently defined) by that logic.

It's still conceptually very counter-intuitive to me to slap a version number on a list of things and then be able to manipulate that list (by adding to or removing from it in this case) without requiring a different version. IMO, these versions should not exclusively be a tool for a producer to determine whether it needs to do some casting before calling a function in order to be compatible, but also for people to see "has something changed," or for other things we just haven't thought of yet. Imagine a documentation tool for instance that uses these versions to determine whether it needs to regenerate something; clearly it does if an implementation is removed or added.

But I see your point -- a consumer may reasonably only implement a subset of a function's implementations -- and would personally address it by giving each implementation a version number rather than versioning a function name. In fact, I'd argue that these function names are weird to begin with. I'm struggling to even come up with a word for them. Shouldn't they just be separate functions for all intents and purposes, except for that we silly humans conceptualize them as doing roughly the same thing? That's how C handles overloads, anyway, and also kind of how Substrait does it, considering how compound names work. So I would want to avoid attaching more meaning to a group of implementations with the same name than... a group of implementations that happen to have the same name because it makes sense to humans.

When it comes to more intelligently dealing with a "no" response to "would you be compatible with the query I've concocted?" than just propagating the error onto the user, such as promoting argument types to get to an implementation that does exist, I don't think it's currently safe to assume that the function would still behave the same after the upcast just because they have the same name. You pointed that out yourself during the sync meeting after all. add(NULL, 100i8, 100i8) should yield null; if a producer would be allowed to transform that to cast<i8>(add(NULL, cast<i6>(100i8), cast<i6>(100i8))) if the consumer doesn't support add(i8, i8) but does support add(i16, i16), I imagine you'd be changing functionality. So I don't think a producer would ever be able to make those kinds of decisions without already having some understanding of what add:opt_i8_i8 and add:opt_i16_i16 do and how cast<i8>(i16) behaves. Ergo, just querying "List the supported implementations for add-1.0.0" would not be enough information in general, so I'm not sure what it adds on top of "do you support this particular implementation of add-1.0.0 y/n?" or "list all supported implementations of all functions [for this URI]", except for complexity.

I'm just piping my thoughts straight to github at this point, but what would you think of a scheme like this:

We give functions version numbers only for the English description of its functionality, not to the argument list, return type derivation expression, purity flags, etc. I'm not sure full semver is needed for this scheme; just the equivalent of a major version is probably sufficient.
A producer can query a list of all function implementations for a given YAML URI supported by the producer it's talking to, including machine-readable metadata i.e. uri -> list(tuple(compound function/implementation name, version, arg list, return type, purity, ...)). It should probably be able to yield more than one entry for each function implementation if it supports multiple versions, so perhaps uri -> map(compound function/implementation name, list(version, arg list, return type, purity, ...)) would be more descriptive of what I mean.
- There should probably be some kind of caching mechanism in the protocol so the producer doesn't have to request this every time. I would prefer that over filtering by function name, though I can't quite put my finger on why; it just seems comparatively brittle and prone to needing to go back and forth between the consumer and producer a lot, and I feel like bandwidth is a lot cheaper than latency in this case.
A producer never has to actually download and parse a YAML file, because it gets all the information it needs straight from the consumer it's talking to. So, big plus of this scheme: we can stop wracking our brains about how we should deal with broken links or someone changing the content of a YAML file when they shouldn't. The YAMLs themselves would basically just be reduced to a specification of an extension, rather than something more functional and concrete; they'd only be for people to read and for validators to validate.

In this scheme we don't need complex versioning because, rather than hiding it behind a version number, we just query all machine-readable information straight from the consumer. We only need a version number for the more "human" description of functionality, in conjunction with the function name. I actually wonder if we should bother with version numbers at all in this case; maybe we should just say that if we correct a description along the lines of "1 + 1 is not in fact 3 but 2" we should just call the function add2 instead of add. I suppose it might make error messages less descriptive if a consumer only supports a version of a function that's too old, but I think that's about it.

As for add-2 by the way, I'm rather strongly opposed to allowing things that don't look like normal identifiers for names. I don't see what it adds and I've been bitten by exactly this in a DSL before, and we do seem to want a text format eventually, after all.

westonpace · 2023-03-08T18:58:43Z

I think this is useful discussion but I don't think there is any enthusiasm for acting on this. At the end of the day I suspect we will be able to get by with versioned names (e.g. add, add2). We may even someday end up with rather bizarre functions like add_legacy_spark2 because of some bug that was present in a particular engine which users came to rely on.

So I am going to close this issue but if anyone wants to return to this discussion and make a new proposal they are welcome to do so.

jvanstraten mentioned this issue Sep 26, 2022

feat: attach Substrait version number to plans #347

Merged

westonpace mentioned this issue Nov 3, 2022

docs: add tutorial for reading plans #354

Merged

westonpace closed this as completed Mar 8, 2023

westonpace mentioned this issue Apr 18, 2024

Function names & URIs #634

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Functions backwards compatibility #274

Functions backwards compatibility #274

westonpace commented Aug 3, 2022

jvanstraten commented Aug 15, 2022

westonpace commented Aug 15, 2022

jvanstraten commented Aug 15, 2022

westonpace commented Aug 15, 2022

jvanstraten commented Aug 16, 2022

westonpace commented Aug 17, 2022

jvanstraten commented Aug 17, 2022

westonpace commented Mar 8, 2023

Functions backwards compatibility #274

Functions backwards compatibility #274

Comments

westonpace commented Aug 3, 2022

jvanstraten commented Aug 15, 2022

westonpace commented Aug 15, 2022

jvanstraten commented Aug 15, 2022

westonpace commented Aug 15, 2022

jvanstraten commented Aug 16, 2022

westonpace commented Aug 17, 2022

jvanstraten commented Aug 17, 2022

westonpace commented Mar 8, 2023