Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only first job context is taken into consideration. #2230

Closed
JDarDagran opened this issue Nov 4, 2022 · 3 comments · Fixed by #2373
Closed

Only first job context is taken into consideration. #2230

JDarDagran opened this issue Nov 4, 2022 · 3 comments · Fixed by #2373
Assignees

Comments

@JDarDagran
Copy link
Contributor

Problem

Job context is a structure that serves as code location / SQL container to show them in Marquez UI. Job context upsert takes only checksum on context's body on conflict. This means that when e.g. at the start and the end of job the context is different there would be 2 different entries in job_context table for this job. That still might be ok, however this has its result in exposing in API only first captured context which means if you don't send SqlJobFacet in the START event you won't see it even if you send it in the COMPLETE event.

Solutions

I foresee couple of ways to solve this problem:

  1. Update job_context_uuid when upserting into runs table. This will result in getting only most recent context exposed which might be acceptable but probably not.
  2. Add some custom logic to merge arrays when context relates to the same run (or job?).
  3. Merge contexts in API. This would change run <--> job_context relation to 1-to-many.
  4. Change structure of job_contexts table: replace context column with 3 following: code_location_type, code_location_url, sql which would be filled on upsert. Some concatenation would still be needed probably.
@JDarDagran
Copy link
Contributor Author

The other solution proposed by @wslulciuc is to use facets directly and deprecate job_context.

@mobuchowski
Copy link
Contributor

I agree, those features can be fetched from facets directly. If there's no other uses for job_contexts a lot of code could be removed.

@wslulciuc
Copy link
Member

wslulciuc commented Nov 22, 2022

(Sorry for the late reply, @JDarDagran). The jobs.context has been deprecated in favor of using job facets defined by OpenLineage (see #910). Before OpenLineage, Marquez used key/value string pairs to annotate datasets, jobs, etc with custom metadata (think model extensibility). With the introduction/adoption of the standard, we no longer need to maintain our own extensibility model but rather leverage the extensibility model defined by OpenLineage. That said, we would still need "merging" logical for facets when quering the discrete facet tables defined in #2152

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants