-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
point in time proposal #2193
point in time proposal #2193
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2193 +/- ##
=========================================
Coverage 76.47% 76.47%
Complexity 1113 1113
=========================================
Files 216 216
Lines 5203 5203
Branches 421 421
=========================================
Hits 3979 3979
Misses 752 752
Partials 472 472 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
49fbc96
to
85ca8ef
Compare
85ca8ef
to
1c12088
Compare
proposals/2117-marquez-over-time.md
Outdated
* `/api/v1/namespaces/some-namespace/datasets/some-dataset?snapshotAt=dataset_version:5ca3b37e-4e18-11ed-bdc3-0242ac120002` | ||
* `/api/v1/namespaces/some-namespace/jobs/some-job?snapshotAt=job_version:5ca3b37e-4e18-11ed-bdc3-0242ac120002` | ||
* `/api/v1/namespaces/some-namespace/jobs/some-job?snapshotAt=run_id:5ca3b37e-4e18-11ed-bdc3-0242ac120002` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are these different from the /versions
variant of the APIs?
https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/api/JobResource.java#L111-L131
https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/api/DatasetResource.java#L101-L121
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versions
variant of the APIs are implemented for Dataset
and Job
and this approach does not seem to be extendable to lineage
or column-lineage
endpoints. It makes sense to ask for lineage at specific run_id
or lineage of a specific dataset_version
. Lineage can be versioned by multiple params like tracking how the lineage looked like for different dataset versions or how did it look like for certain run_ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then if this is particular to the /lineage
and /column-lineage
APIs, let's update the URLs here to point to those APIs rather than the /datasets
and /jobs
APIs. I don't think we need to change the existing job/dataset APIs, as they're already being used.
Currently, the /lineage
and /column-lineage
APIs accept a NodeId
as their argument. If asking for a run
, we shouldn't need any extra parameters, as the point-in-time parameter is inferred. For the job
and dataset
nodes, can we simply pass in a version
parameter? Whether by modifying the node id (e.g., job:abc@version
or something similar) or by passing in a query parameter.
|
||
## Problem | ||
|
||
Marquez data model in PostgresSQL allows extracting a snapshot of any lineage content from the past. This may be extremely |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intention here to return lineage for a given point in time? Or just the dataset/job definitions as they were at a particular version? If the latter, don't we already have that? If the former, do you intend to store the lineage information differently? The current storage model is sufficient to return point-in-time lineage, but it's really slow if looking at anything but the latest version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention of this PR is to make consensus on how do we expose point in time endpoints, including lineage endpoints as well. Storage model, and performance related to it, is specific to each endpoint and it's outside the scope of this PR. You're right that it may require some remodeling but I think this should be discussed one by one when implementing specific endpoint for point in time.
In other words: this PR shows how API for lineage over time should look like, but not how to implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with agreeing on the API changes before tackling the storage, but I do think there ought to be a design doc tackling the storage issue up front. We're really only talking about the /lineage
and /column-lineage
APIs here, so it's really a question of how to store point-in-time lineage in a scalable way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @collado-mike, to get a snapshot of a lineage graph the caller should invoke GET
/lineage
with a nodeId
(see proposed nodeId
s below). A lineage graph is modified within the context of a run. We can call these incremental accumulation of changes to the lineage graph at a given point in time (via some run R
) as a run-level graph (=lineage graph snapshot). A run-level graph is directed and consists of three node types: dataset version, job version, and run. The graph represents the relationships between dataset, job, and run metadata at a given point in time.
Graph Data Model
A run-level graph consists of the following nodes:
- Dataset Version: A read-only immutable version of a dataset.
- Job Version: A read-only immutable version of a job, with a unique referenceable link to code preserving the reproducibility of builds from source.
- Run: A discrete instantiation of a job version, with a unique run ID used to update each stage of execution.
Nodes
ID | dataset:{namespace}:{dataset}#{version} |
---|---|
Example | dataset:food_delivery:public.top_delivery_times#947c0388.. |
ID | job:{namespace}:{job}#{version} |
---|---|
Example | job:food_delivery:orders_popular_day_of_week#947c0388.. |
ID | run:{id} |
---|---|
Example | run:a03422cf.. |
Note: The
nodeID
datasetField:{namespace}:{dataset}:{field}
isn't accounted for in the run-level graph data model.
proposals/2117-marquez-over-time.md
Outdated
* `/api/v1/namespaces/some-namespace/datasets/some-dataset?snapshotAt=dataset_version:5ca3b37e-4e18-11ed-bdc3-0242ac120002` | ||
* `/api/v1/namespaces/some-namespace/jobs/some-job?snapshotAt=job_version:5ca3b37e-4e18-11ed-bdc3-0242ac120002` | ||
* `/api/v1/namespaces/some-namespace/jobs/some-job?snapshotAt=run_id:5ca3b37e-4e18-11ed-bdc3-0242ac120002` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes sense to me.
Are we planning to also add this to the lineage endpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even though I have approved this, you should resolve @collado-mike ' s comments first. (sorry mike I submitted before seeing your comments)
7152116
to
ec34820
Compare
Signed-off-by: Pawel Leszczynski <[email protected]>
ec34820
to
21c69d5
Compare
@wslulciuc @collado-mike I think you're right. Thank you. To sum up:
|
Looks good to me @pawel-big-lebowski. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Signed-off-by: Pawel Leszczynski [email protected]
Problem
Proposal on extending Marquez API to be able to request historical data.
Part of : #2117
Solution
Please describe your change as it relates to the problem, or bug fix, as well as any dependencies. If your change requires a database schema migration, please describe the schema modification(s) and whether it's a backwards-incompatible or backwards-compatible change.
Checklist
CHANGELOG.md
with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary).sql
database schema migration according to Flyway's naming convention (if relevant)