Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

point in time proposal #2193

Merged
merged 1 commit into from
Nov 22, 2022
Merged

point in time proposal #2193

merged 1 commit into from
Nov 22, 2022

Conversation

pawel-big-lebowski
Copy link
Collaborator

Signed-off-by: Pawel Leszczynski [email protected]

Problem

Proposal on extending Marquez API to be able to request historical data.

Part of : #2117

Solution

Please describe your change as it relates to the problem, or bug fix, as well as any dependencies. If your change requires a database schema migration, please describe the schema modification(s) and whether it's a backwards-incompatible or backwards-compatible change.

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@codecov
Copy link

codecov bot commented Oct 17, 2022

Codecov Report

Merging #2193 (21c69d5) into main (8928625) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##               main    #2193   +/-   ##
=========================================
  Coverage     76.47%   76.47%           
  Complexity     1113     1113           
=========================================
  Files           216      216           
  Lines          5203     5203           
  Branches        421      421           
=========================================
  Hits           3979     3979           
  Misses          752      752           
  Partials        472      472           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@pawel-big-lebowski pawel-big-lebowski marked this pull request as draft October 17, 2022 13:38
proposals/2117-marquez-over-time.md Outdated Show resolved Hide resolved
proposals/2117-marquez-over-time.md Outdated Show resolved Hide resolved
Comment on lines 19 to 21
* `/api/v1/namespaces/some-namespace/datasets/some-dataset?snapshotAt=dataset_version:5ca3b37e-4e18-11ed-bdc3-0242ac120002`
* `/api/v1/namespaces/some-namespace/jobs/some-job?snapshotAt=job_version:5ca3b37e-4e18-11ed-bdc3-0242ac120002`
* `/api/v1/namespaces/some-namespace/jobs/some-job?snapshotAt=run_id:5ca3b37e-4e18-11ed-bdc3-0242ac120002`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versions variant of the APIs are implemented for Dataset and Job and this approach does not seem to be extendable to lineage or column-lineage endpoints. It makes sense to ask for lineage at specific run_id or lineage of a specific dataset_version. Lineage can be versioned by multiple params like tracking how the lineage looked like for different dataset versions or how did it look like for certain run_ids.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then if this is particular to the /lineage and /column-lineage APIs, let's update the URLs here to point to those APIs rather than the /datasets and /jobs APIs. I don't think we need to change the existing job/dataset APIs, as they're already being used.

Currently, the /lineage and /column-lineage APIs accept a NodeId as their argument. If asking for a run, we shouldn't need any extra parameters, as the point-in-time parameter is inferred. For the job and dataset nodes, can we simply pass in a version parameter? Whether by modifying the node id (e.g., job:abc@version or something similar) or by passing in a query parameter.


## Problem

Marquez data model in PostgresSQL allows extracting a snapshot of any lineage content from the past. This may be extremely
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention here to return lineage for a given point in time? Or just the dataset/job definitions as they were at a particular version? If the latter, don't we already have that? If the former, do you intend to store the lineage information differently? The current storage model is sufficient to return point-in-time lineage, but it's really slow if looking at anything but the latest version.

Copy link
Collaborator Author

@pawel-big-lebowski pawel-big-lebowski Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention of this PR is to make consensus on how do we expose point in time endpoints, including lineage endpoints as well. Storage model, and performance related to it, is specific to each endpoint and it's outside the scope of this PR. You're right that it may require some remodeling but I think this should be discussed one by one when implementing specific endpoint for point in time.

In other words: this PR shows how API for lineage over time should look like, but not how to implement it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with agreeing on the API changes before tackling the storage, but I do think there ought to be a design doc tackling the storage issue up front. We're really only talking about the /lineage and /column-lineage APIs here, so it's really a question of how to store point-in-time lineage in a scalable way.

Copy link
Member

@wslulciuc wslulciuc Nov 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @collado-mike, to get a snapshot of a lineage graph the caller should invoke GET /lineage with a nodeId (see proposed nodeIds below). A lineage graph is modified within the context of a run. We can call these incremental accumulation of changes to the lineage graph at a given point in time (via some run R) as a run-level graph (=lineage graph snapshot). A run-level graph is directed and consists of three node types: dataset version, job version, and run. The graph represents the relationships between dataset, job, and run metadata at a given point in time.

Graph Data Model

A run-level graph consists of the following nodes:

  • Dataset Version: A read-only immutable version of a dataset.
  • Job Version: A read-only immutable version of a job, with a unique referenceable link to code preserving the reproducibility of builds from source.
  • Run: A discrete instantiation of a job version, with a unique run ID used to update each stage of execution.

Nodes

ID dataset:{namespace}:{dataset}#{version}
Example dataset:food_delivery:public.top_delivery_times#947c0388..
ID job:{namespace}:{job}#{version}
Example job:food_delivery:orders_popular_day_of_week#947c0388..
ID run:{id}
Example run:a03422cf..

Note: The nodeID datasetField:{namespace}:{dataset}:{field} isn't accounted for in the run-level graph data model.

Comment on lines 19 to 21
* `/api/v1/namespaces/some-namespace/datasets/some-dataset?snapshotAt=dataset_version:5ca3b37e-4e18-11ed-bdc3-0242ac120002`
* `/api/v1/namespaces/some-namespace/jobs/some-job?snapshotAt=job_version:5ca3b37e-4e18-11ed-bdc3-0242ac120002`
* `/api/v1/namespaces/some-namespace/jobs/some-job?snapshotAt=run_id:5ca3b37e-4e18-11ed-bdc3-0242ac120002`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense to me.
Are we planning to also add this to the lineage endpoint?

proposals/2117-marquez-over-time.md Outdated Show resolved Hide resolved
Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even though I have approved this, you should resolve @collado-mike ' s comments first. (sorry mike I submitted before seeing your comments)

@pawel-big-lebowski pawel-big-lebowski force-pushed the proposals/query-over-time branch 2 times, most recently from 7152116 to ec34820 Compare November 22, 2022 10:22
Signed-off-by: Pawel Leszczynski <[email protected]>
@pawel-big-lebowski
Copy link
Collaborator Author

@wslulciuc @collado-mike I think you're right. Thank you.

To sum up:

  • the proposal refers only to modifying lineage and column-lineage endpoints,
  • nodeId will be extended to contain version.

@mobuchowski
Copy link
Contributor

Looks good to me @pawel-big-lebowski.

Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@wslulciuc wslulciuc merged commit 0c78cd8 into main Nov 22, 2022
@wslulciuc wslulciuc deleted the proposals/query-over-time branch November 22, 2022 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants