point in time proposal #2193

pawel-big-lebowski · 2022-10-17T13:00:09Z

Signed-off-by: Pawel Leszczynski [email protected]

Problem

Proposal on extending Marquez API to be able to request historical data.

Part of : #2117

Solution

Please describe your change as it relates to the problem, or bug fix, as well as any dependencies. If your change requires a database schema migration, please describe the schema modification(s) and whether it's a backwards-incompatible or backwards-compatible change.

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

You've signed-off your work
Your changes are accompanied by tests (if relevant)
Your change contains a small diff and is self-contained
You've updated any relevant documentation (if relevant)
You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
You've included a header in any source code files (if relevant)

codecov · 2022-10-17T13:03:51Z

Codecov Report

Merging #2193 (21c69d5) into main (8928625) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##               main    #2193   +/-   ##
=========================================
  Coverage     76.47%   76.47%           
  Complexity     1113     1113           
=========================================
  Files           216      216           
  Lines          5203     5203           
  Branches        421      421           
=========================================
  Hits           3979     3979           
  Misses          752      752           
  Partials        472      472

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

proposals/2117-marquez-over-time.md

collado-mike · 2022-11-08T17:12:30Z

proposals/2117-marquez-over-time.md

+ * `/api/v1/namespaces/some-namespace/datasets/some-dataset?snapshotAt=dataset_version:5ca3b37e-4e18-11ed-bdc3-0242ac120002`
+ * `/api/v1/namespaces/some-namespace/jobs/some-job?snapshotAt=job_version:5ca3b37e-4e18-11ed-bdc3-0242ac120002`
+ * `/api/v1/namespaces/some-namespace/jobs/some-job?snapshotAt=run_id:5ca3b37e-4e18-11ed-bdc3-0242ac120002`


How are these different from the /versions variant of the APIs?

https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/api/JobResource.java#L111-L131
https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/api/DatasetResource.java#L101-L121

versions variant of the APIs are implemented for Dataset and Job and this approach does not seem to be extendable to lineage or column-lineage endpoints. It makes sense to ask for lineage at specific run_id or lineage of a specific dataset_version. Lineage can be versioned by multiple params like tracking how the lineage looked like for different dataset versions or how did it look like for certain run_ids.

Ok, then if this is particular to the /lineage and /column-lineage APIs, let's update the URLs here to point to those APIs rather than the /datasets and /jobs APIs. I don't think we need to change the existing job/dataset APIs, as they're already being used.

Currently, the /lineage and /column-lineage APIs accept a NodeId as their argument. If asking for a run, we shouldn't need any extra parameters, as the point-in-time parameter is inferred. For the job and dataset nodes, can we simply pass in a version parameter? Whether by modifying the node id (e.g., job:abc@version or something similar) or by passing in a query parameter.

collado-mike · 2022-11-08T17:14:33Z

proposals/2117-marquez-over-time.md

+
+## Problem
+
+Marquez data model in PostgresSQL allows extracting a snapshot of any lineage content from the past. This may be extremely


Is the intention here to return lineage for a given point in time? Or just the dataset/job definitions as they were at a particular version? If the latter, don't we already have that? If the former, do you intend to store the lineage information differently? The current storage model is sufficient to return point-in-time lineage, but it's really slow if looking at anything but the latest version.

The intention of this PR is to make consensus on how do we expose point in time endpoints, including lineage endpoints as well. Storage model, and performance related to it, is specific to each endpoint and it's outside the scope of this PR. You're right that it may require some remodeling but I think this should be discussed one by one when implementing specific endpoint for point in time.

In other words: this PR shows how API for lineage over time should look like, but not how to implement it.

I'm fine with agreeing on the API changes before tackling the storage, but I do think there ought to be a design doc tackling the storage issue up front. We're really only talking about the /lineage and /column-lineage APIs here, so it's really a question of how to store point-in-time lineage in a scalable way.

I agree with @collado-mike, to get a snapshot of a lineage graph the caller should invoke GET /lineage with a nodeId (see proposed nodeIds below). A lineage graph is modified within the context of a run. We can call these incremental accumulation of changes to the lineage graph at a given point in time (via some run R) as a run-level graph (=lineage graph snapshot). A run-level graph is directed and consists of three node types: dataset version, job version, and run. The graph represents the relationships between dataset, job, and run metadata at a given point in time.

Graph Data Model

A run-level graph consists of the following nodes:

Dataset Version: A read-only immutable version of a dataset.

Job Version: A read-only immutable version of a job, with a unique referenceable link to code preserving the reproducibility of builds from source.

Run: A discrete instantiation of a job version, with a unique run ID used to update each stage of execution.

Nodes

ID dataset:{namespace}:{dataset}#{version}

Example dataset:food_delivery:public.top_delivery_times#947c0388..

ID job:{namespace}:{job}#{version}

Example job:food_delivery:orders_popular_day_of_week#947c0388..

ID run:{id}

Example run:a03422cf..

Note: The nodeID datasetField:{namespace}:{dataset}:{field} isn't accounted for in the run-level graph data model.

julienledem · 2022-11-08T18:04:53Z

proposals/2117-marquez-over-time.md

+ * `/api/v1/namespaces/some-namespace/datasets/some-dataset?snapshotAt=dataset_version:5ca3b37e-4e18-11ed-bdc3-0242ac120002`
+ * `/api/v1/namespaces/some-namespace/jobs/some-job?snapshotAt=job_version:5ca3b37e-4e18-11ed-bdc3-0242ac120002`
+ * `/api/v1/namespaces/some-namespace/jobs/some-job?snapshotAt=run_id:5ca3b37e-4e18-11ed-bdc3-0242ac120002`


this makes sense to me.
Are we planning to also add this to the lineage endpoint?

proposals/2117-marquez-over-time.md

julienledem

even though I have approved this, you should resolve @collado-mike ' s comments first. (sorry mike I submitted before seeing your comments)

Signed-off-by: Pawel Leszczynski <[email protected]>

pawel-big-lebowski · 2022-11-22T10:49:28Z

@wslulciuc @collado-mike I think you're right. Thank you.

To sum up:

the proposal refers only to modifying lineage and column-lineage endpoints,
nodeId will be extended to contain version.

mobuchowski · 2022-11-22T12:54:58Z

Looks good to me @pawel-big-lebowski.

wslulciuc

LGTM 👍

boring-cyborg bot added docs proposal labels Oct 17, 2022

pawel-big-lebowski force-pushed the proposals/query-over-time branch from 49fbc96 to 85ca8ef Compare October 17, 2022 13:04

pawel-big-lebowski requested review from collado-mike, julienledem, mobuchowski and wslulciuc October 17, 2022 13:05

mobuchowski reviewed Oct 17, 2022

View reviewed changes

proposals/2117-marquez-over-time.md Outdated Show resolved Hide resolved

pawel-big-lebowski marked this pull request as draft October 17, 2022 13:38

collado-mike reviewed Oct 17, 2022

View reviewed changes

proposals/2117-marquez-over-time.md Outdated Show resolved Hide resolved

proposals/2117-marquez-over-time.md Outdated Show resolved Hide resolved

pawel-big-lebowski force-pushed the proposals/query-over-time branch from 85ca8ef to 1c12088 Compare October 18, 2022 08:43

pawel-big-lebowski marked this pull request as ready for review October 18, 2022 08:45

pawel-big-lebowski requested review from mobuchowski and collado-mike November 3, 2022 09:12

collado-mike reviewed Nov 8, 2022

View reviewed changes

julienledem approved these changes Nov 8, 2022

View reviewed changes

julienledem reviewed Nov 8, 2022

View reviewed changes

pawel-big-lebowski force-pushed the proposals/query-over-time branch 2 times, most recently from 7152116 to ec34820 Compare November 22, 2022 10:22

point in time proposal

21c69d5

Signed-off-by: Pawel Leszczynski <[email protected]>

pawel-big-lebowski force-pushed the proposals/query-over-time branch from ec34820 to 21c69d5 Compare November 22, 2022 10:37

wslulciuc mentioned this pull request Nov 22, 2022

Present lineage changes over time #1922

Open

mobuchowski approved these changes Nov 22, 2022

View reviewed changes

wslulciuc approved these changes Nov 22, 2022

View reviewed changes

wslulciuc merged commit 0c78cd8 into main Nov 22, 2022

wslulciuc deleted the proposals/query-over-time branch November 22, 2022 15:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

point in time proposal #2193

point in time proposal #2193

pawel-big-lebowski commented Oct 17, 2022

codecov bot commented Oct 17, 2022 •

edited

Loading

collado-mike Nov 8, 2022

pawel-big-lebowski Nov 9, 2022

collado-mike Nov 10, 2022

collado-mike Nov 8, 2022

pawel-big-lebowski Nov 9, 2022 •

edited

Loading

collado-mike Nov 10, 2022

wslulciuc Nov 17, 2022 •

edited

Loading

julienledem Nov 8, 2022

julienledem left a comment

pawel-big-lebowski commented Nov 22, 2022

mobuchowski commented Nov 22, 2022

wslulciuc left a comment


		## Problem

		Marquez data model in PostgresSQL allows extracting a snapshot of any lineage content from the past. This may be extremely

point in time proposal #2193

point in time proposal #2193

Conversation

pawel-big-lebowski commented Oct 17, 2022

Problem

Solution

Checklist

codecov bot commented Oct 17, 2022 • edited Loading

Codecov Report

collado-mike Nov 8, 2022

Choose a reason for hiding this comment

pawel-big-lebowski Nov 9, 2022

Choose a reason for hiding this comment

collado-mike Nov 10, 2022

Choose a reason for hiding this comment

collado-mike Nov 8, 2022

Choose a reason for hiding this comment

pawel-big-lebowski Nov 9, 2022 • edited Loading

Choose a reason for hiding this comment

collado-mike Nov 10, 2022

Choose a reason for hiding this comment

wslulciuc Nov 17, 2022 • edited Loading

Choose a reason for hiding this comment

Graph Data Model

Nodes

julienledem Nov 8, 2022

Choose a reason for hiding this comment

julienledem left a comment

Choose a reason for hiding this comment

pawel-big-lebowski commented Nov 22, 2022

mobuchowski commented Nov 22, 2022

wslulciuc left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 17, 2022 •

edited

Loading

pawel-big-lebowski Nov 9, 2022 •

edited

Loading

wslulciuc Nov 17, 2022 •

edited

Loading