-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal/2071 update version versioning #2153
Proposal/2071 update version versioning #2153
Conversation
Signed-off-by: Ryan Hatter <[email protected]>
I could use some help with a couple things here:
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2153 +/- ##
============================================
- Coverage 83.76% 75.78% -7.99%
+ Complexity 1338 1061 -277
============================================
Files 247 209 -38
Lines 6112 5006 -1106
Branches 281 403 +122
============================================
- Hits 5120 3794 -1326
+ Misses 843 763 -80
- Partials 149 449 +300 ☔ View full report in Codecov by Sentry. |
## Proposal | ||
This proposal aims to simplify the Marquez versioning system and enable new features to support dataset versions supplied by external data systems. | ||
|
||
Moving forward, `DatasetVersionRow` and its cousin [`ExtendedDatasetVersionRow`](https://github.com/MarquezProject/marquez/blob/0.26.0/api/src/main/java/marquez/db/models/ExtendedDatasetVersionRow.java#L26) will no longer support a `version` field; rather, this will be dropped and replaced by a nullable `externalVersion`. This `externalVersion` field will be populated if provided, using [OpenLineage's DatasetVersionDatasetFacet facet](https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DatasetVersionDatasetFacet.json) to support external dataset versions. `DatasetVersionRow`'s `uuid` field will serve as the version's sole identifier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment on how the public API is impacted (if at all)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not from a user's perspective as far as I can tell, although there is another JobVersion
in the public API that will need the version
field removed. I'll also need to update the various versions in the clients. I'll add all of these to the list of TODOs in the implementation section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I guess, while the public API will be backward compatible, we'll want to expose externalVersion
to the public API as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the proposal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking specifically about how the URLs for fetching DatasetVersions will work and how we return references to the DatasetVersion. E.g., this structure in the spec references a UUID version, but I don't know if that is going to change or if it will just reference the DatasetVersion's primary key. So what is returned in the Run
spec's inputDatasetVersions
field?
Additionally, the getDatasetVersion
spec accepts a version
parameter. Is that the UUID? or the externalVersion
?
Personally, I prefer the externalVersion
, if present, as it introduces compatibility between systems - e.g., a Snowflake dashboard can just know how to construct a link to the Marquez UI for a given dataset@version because the standard is known and consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@collado-mike: very good points on backwards compatibility. To ensure we don't introduce any breaking changes, Marquez will continue to reference the uuid
of dataset_versions
and set that value accordingly. That is, in the case of Datasets.currentVersion
, the value will continue to be of type uuid
, but with the addition of Datasets.currentExternalVersion
(we can think of this as an alias). More specifically, for DatasetVersionId
, we will add Datasets.externalVersion
, which can be null
, but Datasets.version
will always be set.
I think retrieving a dataset version is less obvious, but we can update the endpoint:
GET /namespaces/{namespace}/datasets/{dataset}/versions/{version}
where {version}
can either be a uuid
or a string
type. The parsing on the backend really isn't that complicated and be easy enough to do.
Another option would be we expose a new endpoint, but that seems unnecessary and would prefer being flexible on {version}
type:
GET /namespaces/{namespace}/datasets/{dataset}/versions/{version}/external
/cc @RNHTTR
Signed-off-by: Ryan Hatter <[email protected]>
c7dab4f
to
26175f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great proposal, @RNHTTR! I'm going to merge as this PR has been around for a while (really my fault here 😅). I've responded with my thoughts on backwards compatibility, we can follow up, but feel what you outlined for versioning is in a state where someone can pick up the work.
✅ Deploy Preview for peppy-sprite-186812 canceled.
|
Overview
This PR formally introduces Proposal: Update DatasetVersion versioning #2071
Checklist
CHANGELOG.md
with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary).sql
database schema migration according to Flyway's naming convention (if relevant)