-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upstream run level lineage implementation #2658
Conversation
Signed-off-by: Julien Le Dem <[email protected]>
✅ Deploy Preview for peppy-sprite-186812 canceled.
|
TODO: |
Signed-off-by: Julien Le Dem <[email protected]>
Codecov Report
@@ Coverage Diff @@
## main #2658 +/- ##
============================================
+ Coverage 83.60% 83.76% +0.16%
- Complexity 1334 1338 +4
============================================
Files 245 247 +2
Lines 6076 6112 +36
Branches 280 281 +1
============================================
+ Hits 5080 5120 +40
+ Misses 844 843 -1
+ Partials 152 149 -3
📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another PR with a moderate size and huge feature contained 👍
First-string 💯
0 AS depth | ||
FROM runs r | ||
LEFT JOIN runs_input_mapping rim ON rim.run_uuid = r.uuid | ||
LEFT JOIN dataset_versions dv ON dv.uuid = rim.dataset_version_uuid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking loudly. Wouldn't be better to join dataset_versions after the recursion at the bottom of the query once all the runs are identified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought of that as well but we actually need dataset_versions in the recursion because this is where we find the run_uuid that produced the DV for the next iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idea: we could add that run uuid to runs_input_mapping at the same time as the dataset_version that would allow to join just on that table in the recursion. That'd be neat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't get the idea: do you want to add dataset_versions to runs_input_mapping or run_uuid to dataset_versions?
Anyway, the current state of the query looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is we we could add the dataset_versions.run_uuid to the runs_input_mapping table.
The run_uuid is always already set in the dataset_versions table when we write the dataset_versions.uuid to the runs_input_mapping table and it never changes since it is the run that created that dataset version.
Then you just need to recursively join the runs_input_mapping over itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I would keep that for future improvement.
dv2.run_uuid, | ||
ur.depth + 1 AS depth | ||
FROM upstream_runs ur | ||
INNER JOIN runs r2 ON r2.uuid = ur.u_r_uuid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runs
table can get huge. Is it doable to omit it in recursion step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where I find the job information (r2.job_uuid, r2.job_version_uuid, r2.namespace_name, r2.job_name). Do you think I could get it a different way?
unless we add it to runs_input_mapping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think I could actually move the runs join out of the recursion into the final select statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just updated the query to move the joins on the runs
table outside of the recursion and in the final select statement.
public Response getRunLineageUpstream( | ||
@QueryParam("runId") @NotNull RunId runId, | ||
@QueryParam("depth") @DefaultValue(DEFAULT_DEPTH) int depth, | ||
@QueryParam("facets") String facets) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we pulling this in? I don't see it being used in the service. Did you intend to do something in the TODO block you left in there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my idea is to be able to select what facets to return for each dataset_version job_version and run in the result.
I can remove it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this parameter in this iteration
Signed-off-by: Julien Le Dem <[email protected]>
Signed-off-by: Julien Le Dem <[email protected]>
-- present the result: use Distinct as we may have traversed the same edge multiple times if there are diamonds in the graph. | ||
SELECT DISTINCT ON (upstream_runs.r_uuid, upstream_runs.dataset_version_uuid, upstream_runs.u_r_uuid) | ||
upstream_runs.*, | ||
-- we add the run information after the recursion so that we join with the large run table only once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
The overall idea looks good to me. Once the tests are created, I would be happy to approve it. |
Signed-off-by: Julien Le Dem <[email protected]>
@pawel-big-lebowski: done. |
I also tested this manually on actual production data for performance and it did well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
Please look at the comments and let me know do you think.
I am happy to approve PR afterwards.
@Consumes(APPLICATION_JSON) | ||
@Produces(APPLICATION_JSON) | ||
@Path("/runlineage/upstream") | ||
public Response getRunLineageUpstream( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please mind documenting this in openapi.spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
"COMPLETE", | ||
jobFacet, | ||
Arrays.asList(), | ||
Arrays.asList(dataset)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to add input dataset here read from some other job?
Currently recursion is not tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
r.job_uuid, r.job_version_uuid, r.namespace_name as job_namespace, r.job_name | ||
FROM upstream_runs, runs r WHERE upstream_runs.r_uuid = r.uuid | ||
) sub | ||
ORDER BY depth ASC; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to ORDER BY depth ASC, r.job_name ASC;
Tests rely on the ordering of the query output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Julien Le Dem <[email protected]>
Signed-off-by: Julien Le Dem <[email protected]>
Problem
When trouble shooting an issue and doing root cause analysis, it is usefull to get the upstream run-level lineage to know exactly what version of each job and dataset a run is depending on.
Solution
return the upstream run level lineage.
Example:
http://localhost:3000/api/v1/runlineage/upstream?runId=adc8507c-595e-4d76-9dac-be2bf0ffe1ee
Checklist
CHANGELOG.md
(Depending on the change, this may not be necessary)..sql
database schema migration according to Flyway's naming convention (if relevant)