Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement symlinks in Marquez #2066

Closed
pawel-big-lebowski opened this issue Aug 10, 2022 · 0 comments
Closed

Implement symlinks in Marquez #2066

pawel-big-lebowski opened this issue Aug 10, 2022 · 0 comments
Assignees
Labels

Comments

@pawel-big-lebowski
Copy link
Collaborator

pawel-big-lebowski commented Aug 10, 2022

Problem:

We need an ability to store alternative dataset names. For example hive datasets can be identified by their data files` location or metastore uri with database and table.

Solution in Spec:

SymlinksDatasetFacet -> OpenLineage/OpenLineage#936

Implementation in Marquez:

Model changes:

  • Create extra dataset_symlink table in Marquez with columns: (symlinkUid, name, namespaceUid, symlinkType)
  • Replace name field in datasets table with symlinkUid

Implementation follows the proposed DB changes:

First PR -> reflect current behaviour in modifed schema

  • provide migration SQL for existing instances
  • create a dataset_symlink row whenever dataset is created
  • modify SQLs in dataset_version_dao, etc.

Second PR

  • Extract symlink facet when posting new OpenLineage event
  • fill dataset_symlink with multiple entries per OL event.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant