Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add StorageDatasetFacet to spec #620

Merged
merged 2 commits into from
Mar 31, 2022
Merged

Conversation

pawel-big-lebowski
Copy link
Collaborator

Signed-off-by: Pawel Leszczynski [email protected]

Problem

Spark integration collects table provider information like delta or iceberg within custom facet. This information becomes useful together with a dataset version collected as version is retrieved from the provider. To make it consistent, if a version is within a global spec, the provider information should also be put there, however a more generic name should be considered like storageProvider, backendStorageProvider, storageProperties with fields: format or storageProvider

Closes: #619

Solution

Add datasetProvider facet to spec.

Note: All schema changes require discussion. Please link the issue for context.

  • Your change modifies the core OpenLineage model
  • Your change modifies one or more OpenLineage facets

If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports S3 and GCS filesystem operations, tested with AWS EMR).

Checklist

  • [] You've signed-off your work
  • [] Your pull request title follows our guidelines
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)

Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting this

spec/facets/DatasetProviderDatasetFacet.json Outdated Show resolved Hide resolved
{
"type": "object",
"properties": {
"datasetProvider": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are already in the "provider" facet, this would be "name".
We should document a list of known providers to ensure consistency. Would that apply to other things than Table Formats? Possibly a name more specific than "provider" would be appropriate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the discussion we have, term provider seems to be kind of misleading.
Official iceberg and delta definitions are:

  • Iceberg is a high-performance format for huge analytic tables
  • Delta Lake is a storage layer that brings scalable, ACID transactions to Apache Spark and other big-data engines.

Following that, we could name it storage of the type DatasetStorageDatasetFacet with the fields:

  • storageLayer which can be iceberg, delta
  • fileFormat which can be parquet, orc, etc.

Alternatively we could go with StorageLayerDatasetFacet which is also OK but has a limited extensibility in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified facet into DatasetStorageDatasetFacet as it describes better its content.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.
The name of the field (here "storage") should be consistent with the name of the type so that makes it "StorageDatasetFacet"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. I've applied the changes accordingly.

"description": "Dataset provider like iceberg, delta-lake, etc.",
"type": "string"
},
"datasetFormat": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment: we should document known formats

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@julienledem julienledem Mar 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the description here should be more directive:
It is "iceberg" or "delta"
not "like iceberg or delta"

@pawel-big-lebowski pawel-big-lebowski force-pushed the spec/dataset-provider-facet branch 3 times, most recently from fcb6a0b to 19b723e Compare March 28, 2022 08:47
@julienledem
Copy link
Member

I made few last comments. Otherwise this looks great to me.

@pawel-big-lebowski pawel-big-lebowski changed the title add dataset provider facet add StorageDatasetFacet to spec Mar 30, 2022
Signed-off-by: Pawel Leszczynski <[email protected]>
@pawel-big-lebowski pawel-big-lebowski merged commit 431251d into main Mar 31, 2022
@pawel-big-lebowski pawel-big-lebowski deleted the spec/dataset-provider-facet branch March 31, 2022 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SPEC] Introduce provider facet in global Spec
4 participants