Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal/2071 update version versioning #2153

Merged

Conversation

RNHTTR
Copy link
Contributor

@RNHTTR RNHTTR commented Sep 30, 2022

Overview

This PR formally introduces Proposal: Update DatasetVersion versioning #2071

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@RNHTTR
Copy link
Contributor Author

RNHTTR commented Sep 30, 2022

I could use some help with a couple things here:

  1. Confirming the proposed db changes are backward compatible
  2. Added details for implementing the facet. I assume it'll be part of the DatasetVersionDAO, but I'm not entirely sure.
  3. Authoring db migration

@codecov
Copy link

codecov bot commented Sep 30, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (6eb0ee6) 83.76% compared to head (26175f9) 75.78%.

❗ Current head 26175f9 differs from pull request most recent head 8c88a32. Consider uploading reports for the commit 8c88a32 to get more accurate results

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2153      +/-   ##
============================================
- Coverage     83.76%   75.78%   -7.99%     
+ Complexity     1338     1061     -277     
============================================
  Files           247      209      -38     
  Lines          6112     5006    -1106     
  Branches        281      403     +122     
============================================
- Hits           5120     3794    -1326     
+ Misses          843      763      -80     
- Partials        149      449     +300     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@RNHTTR
Copy link
Contributor Author

RNHTTR commented Oct 10, 2022

FYI @collado-mike @wslulciuc

## Proposal
This proposal aims to simplify the Marquez versioning system and enable new features to support dataset versions supplied by external data systems.

Moving forward, `DatasetVersionRow` and its cousin [`ExtendedDatasetVersionRow`](https://github.com/MarquezProject/marquez/blob/0.26.0/api/src/main/java/marquez/db/models/ExtendedDatasetVersionRow.java#L26) will no longer support a `version` field; rather, this will be dropped and replaced by a nullable `externalVersion`. This `externalVersion` field will be populated if provided, using [OpenLineage's DatasetVersionDatasetFacet facet](https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DatasetVersionDatasetFacet.json) to support external dataset versions. `DatasetVersionRow`'s `uuid` field will serve as the version's sole identifier.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment on how the public API is impacted (if at all)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not from a user's perspective as far as I can tell, although there is another JobVersion in the public API that will need the version field removed. I'll also need to update the various versions in the clients. I'll add all of these to the list of TODOs in the implementation section.

Copy link
Contributor Author

@RNHTTR RNHTTR Oct 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I guess, while the public API will be backward compatible, we'll want to expose externalVersion to the public API as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the proposal

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking specifically about how the URLs for fetching DatasetVersions will work and how we return references to the DatasetVersion. E.g., this structure in the spec references a UUID version, but I don't know if that is going to change or if it will just reference the DatasetVersion's primary key. So what is returned in the Run spec's inputDatasetVersions field?
Additionally, the getDatasetVersion spec accepts a version parameter. Is that the UUID? or the externalVersion?
Personally, I prefer the externalVersion, if present, as it introduces compatibility between systems - e.g., a Snowflake dashboard can just know how to construct a link to the Marquez UI for a given dataset@version because the standard is known and consistent.

Copy link
Member

@wslulciuc wslulciuc Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@collado-mike: very good points on backwards compatibility. To ensure we don't introduce any breaking changes, Marquez will continue to reference the uuid of dataset_versions and set that value accordingly. That is, in the case of Datasets.currentVersion, the value will continue to be of type uuid, but with the addition of Datasets.currentExternalVersion (we can think of this as an alias). More specifically, for DatasetVersionId, we will add Datasets.externalVersion, which can be null, but Datasets.version will always be set.

I think retrieving a dataset version is less obvious, but we can update the endpoint:

GET /namespaces/{namespace}/datasets/{dataset}/versions/{version}

where {version} can either be a uuid or a string type. The parsing on the backend really isn't that complicated and be easy enough to do.

Another option would be we expose a new endpoint, but that seems unnecessary and would prefer being flexible on {version} type:

GET /namespaces/{namespace}/datasets/{dataset}/versions/{version}/external

/cc @RNHTTR

@RNHTTR RNHTTR force-pushed the proposal/2071-update-version-versioning branch from c7dab4f to 26175f9 Compare October 13, 2022 15:13
@collado-mike collado-mike mentioned this pull request Oct 17, 2022
7 tasks
Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great proposal, @RNHTTR! I'm going to merge as this PR has been around for a while (really my fault here 😅). I've responded with my thoughts on backwards compatibility, we can follow up, but feel what you outlined for versioning is in a state where someone can pick up the work.

Copy link

netlify bot commented Nov 14, 2023

Deploy Preview for peppy-sprite-186812 canceled.

Name Link
🔨 Latest commit 8c88a32
🔍 Latest deploy log https://app.netlify.com/sites/peppy-sprite-186812/deploys/6553b618586262000883faeb

@wslulciuc wslulciuc enabled auto-merge (squash) November 14, 2023 18:02
@wslulciuc wslulciuc merged commit 6461a96 into MarquezProject:main Nov 14, 2023
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants