Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: attach Substrait version number to plans #347

Merged
merged 3 commits into from
Oct 5, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions proto/substrait/plan.proto
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,10 @@ message PlanRel {
// Describe a set of operations to complete.
// For compactness sake, identifiers are normalized at the plan level.
message Plan {
// Substrait version of the plan. Optional up to 0.17.0, required for later
// versions.
Version version = 6;

// a list of yaml specifications this plan may depend on
repeated substrait.extensions.SimpleExtensionURI extension_uris = 1;

Expand All @@ -43,3 +47,36 @@ message Plan {
// one or more message types defined here are unknown.
repeated string expected_type_urls = 5;
}

// This message type can be used to deserialize only the version of a Substrait
// Plan message. This prevents deserialization errors when there were breaking
// changes between the Substrait version of the tool that produced the plan and
// the Substrait version used to deserialize it, such that a consumer can emit
// a more helpful error message in this case.
message PlanVersion {
Version version = 6;
}

message Version {
// Substrait version number.
uint32 major = 1;
uint32 minor = 2;
uint32 patch = 3;

// If a particular version of Substrait is used that does not correspond to
// a version number exactly (for example when using an unofficial fork or
// using a version that is not yet released or is between versions), set this
// to the full git hash of the utilized commit of
// https://github.com/substrait-io/substrait (or fork thereof), represented
// using a lowercase hex ASCII string 40 characters in length. The version
// number above should be set to the most recent version tag in the history
// of that commit.
string git_hash = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the major/minor/path be left empty in this case (0.0.0)? Or should they be set to the most recent version before the fork? I would think empty but might be good to clarify.

Copy link
Contributor Author

@jvanstraten jvanstraten Sep 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the last sentence of the comment cover it? I guess I could make explicit that the tag I'm referring to must be an "official" substrait-io version tag.

ETA: and I think there must always be a version tag reachable in the history, because there is a tag reachable from the commit that would add this feature in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, that is clear. I think I just glossed over it reading, sorry

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to introduce what tool name and version produced the plan. In the practical (annoying) world, there may be 3 different versions of a producer that all produce using the exact same Substrait plan version but one of them produces invalid plans that have been stored. Consumers may need to be conditional about what is seen to handle bugs in producers. We've had to do this a bunch of times with Parquet's producer field and I expect the same here (sadly).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't necessarily disagree, but I fear we might end up with something like Isthmus/5.0 (X11; Linux x86_64) Acero/537.36 (KHTML, like Gecko) Velox/105.0.0.0 Ibis/537.36 if we're not careful, especially if we just add a string field for this. For consumer-specific plans (i.e. those using extensions from some particular engine), shouldn't there then also be an optional intended consumer identifier?

Typing a bit more freely I came up with this...

message Plan {
  // Metadata for the plan. Optional up to 0.17.0, required for later versions.
  Metadata metadata = 6;
  
  // ...
}

message Metadata {
  // Optional name of the plan. No semantical meaning; should only be used for
  // debugging and tracing.
  string name = 1;

  // Substrait version of the plan. Optional up to 0.17.0, required for later
  // versions.
  Version version = 2;

  // Identification of the tool generating the plan. Optional (hand-written
  // plans, for example, don't logically have a producer), but strongly
  // recommended when a plan is
  Tool producer = 3;
  
  // If the plan was generated to be compatible with one or more specific
  // consumers, this/these consumer(s) can be identified here. Optional.
  repeated Tool consumer = 4;

  // A place to store custom metadata.
  google.protobuf.Any custom = 5;
}

message Version {
  // Substrait version number.
  uint32 major = 1;
  uint32 minor = 2;
  uint32 patch = 3;

  // If a particular version of Substrait is used that does not correspond to
  // a version number exactly (for example when using an unofficial fork or
  // using a version that is not yet released or is between versions), set this
  // to the full git hash of the utilized commit of
  // https://github.com/substrait-io/substrait (or fork thereof), represented
  // using a lowercase hex ASCII string 40 characters in length. The version
  // number above should be set to the most recent version tag in the history
  // of that commit.
  string git_hash = 4;
}
  
message Tool {
  // Name of the tool, e.g. Acero, Velox, DuckDB, etc. Required.
  string name = 1;

  // Version of the tool. Put the integer components of the version in
  // version_components, and anything that's needed to identify the version in
  // addition in version_suffix. For a release version of tools using semver,
  // version_components should have 3 entries, and version_suffix should be
  // empty.
  repeated uint32 version_components = 2;
  string version_suffix = 3;
  
  // An optional place to store tool-specific information, such as feature
  // flags.
  google.protobuf.Any custom = 4;
}

... but this is all pretty arbitrary and grows without bound.

IMO there's no need to add something like this already and we should postpone this, especially considering we've already had talks of more precise communication of capabilities and negotiation. An embedded Substrait version number, on the other hand, addresses a much more pressing need: if a proposal like #333 would have been accepted (shouldn't we close that PR, since even Weston changed his mind in the end...?) there would have been no way to tell without context what a project relation means, even without accounting for consumer/producer bugs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isthmus/5.0 (X11; Linux x86_64) Acero/537.36 (KHTML, like Gecko) Velox/105.0.0.0 Ibis/537.36 if we're not careful, especially if we just add a string field for this

This is exactly what we want. Ideally, we never need to use the field. Practically, we will. We should add it now before we need it. This isn't structured information, it's a log entry to deal with unexpected bugs later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


// Identifying information for the producer that created this plan. Under
// ideal circumstances, consumers should not need this information. However,
// it is foreseen that consumers may need to work around bugs in particular
// producers in practice, and therefore may need to know which producer
// created the plan.
string producer = 5;
}