add StorageDatasetFacet to spec #620

pawel-big-lebowski · 2022-03-21T13:29:35Z

Signed-off-by: Pawel Leszczynski [email protected]

Problem

Spark integration collects table provider information like delta or iceberg within custom facet. This information becomes useful together with a dataset version collected as version is retrieved from the provider. To make it consistent, if a version is within a global spec, the provider information should also be put there, however a more generic name should be considered like storageProvider, backendStorageProvider, storageProperties with fields: format or storageProvider

Closes: #619

Solution

Add datasetProvider facet to spec.

Note: All schema changes require discussion. Please link the issue for context.

Your change modifies the core OpenLineage model
Your change modifies one or more OpenLineage facets

If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports S3 and GCS filesystem operations, tested with AWS EMR).

Checklist

[] You've signed-off your work
[] Your pull request title follows our guidelines
Your changes are accompanied by tests (if relevant)
Your change contains a small diff and is self-contained
You've updated any relevant documentation (if relevant)
You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)

julienledem

Thanks for starting this

spec/facets/DatasetProviderDatasetFacet.json

julienledem · 2022-03-22T00:39:03Z

spec/facets/DatasetProviderDatasetFacet.json

+        {
+          "type": "object",
+          "properties": {
+            "datasetProvider": {


Since we are already in the "provider" facet, this would be "name".
We should document a list of known providers to ensure consistency. Would that apply to other things than Table Formats? Possibly a name more specific than "provider" would be appropriate.

Given the discussion we have, term provider seems to be kind of misleading.
Official iceberg and delta definitions are:

Iceberg is a high-performance format for huge analytic tables

Delta Lake is a storage layer that brings scalable, ACID transactions to Apache Spark and other big-data engines.

Following that, we could name it storage of the type DatasetStorageDatasetFacet with the fields:

storageLayer which can be iceberg, delta

fileFormat which can be parquet, orc, etc.

Alternatively we could go with StorageLayerDatasetFacet which is also OK but has a limited extensibility in the future.

I modified facet into DatasetStorageDatasetFacet as it describes better its content.

Thank you.
The name of the field (here "storage") should be consistent with the name of the type so that makes it "StorageDatasetFacet"

Thanks for the comments. I've applied the changes accordingly.

julienledem · 2022-03-22T00:39:42Z

spec/facets/DatasetProviderDatasetFacet.json

+              "description": "Dataset provider like iceberg, delta-lake, etc.",
+              "type": "string"
+            },
+            "datasetFormat": {


Similar comment: we should document known formats

Add to https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md perhaps?

I think the description here should be more directive:
It is "iceberg" or "delta"
not "like iceberg or delta"

julienledem · 2022-03-29T16:02:38Z

I made few last comments. Otherwise this looks great to me.

Signed-off-by: Pawel Leszczynski <[email protected]>

julienledem reviewed Mar 22, 2022

View reviewed changes

pawel-big-lebowski force-pushed the spec/dataset-provider-facet branch 3 times, most recently from fcb6a0b to 19b723e Compare March 28, 2022 08:47

pawel-big-lebowski force-pushed the spec/dataset-provider-facet branch from 19b723e to 86ddaa3 Compare March 30, 2022 11:10

pawel-big-lebowski changed the title ~~add dataset provider facet~~ add StorageDatasetFacet to spec Mar 30, 2022

add dataset provider facet

b0d21f7

Signed-off-by: Pawel Leszczynski <[email protected]>

pawel-big-lebowski force-pushed the spec/dataset-provider-facet branch from 7003f7d to b0d21f7 Compare March 30, 2022 11:14

mobuchowski self-requested a review March 30, 2022 11:17

mobuchowski approved these changes Mar 30, 2022

View reviewed changes

collado-mike approved these changes Mar 31, 2022

View reviewed changes

Merge branch 'main' into spec/dataset-provider-facet

be70edf

pawel-big-lebowski merged commit 431251d into main Mar 31, 2022

pawel-big-lebowski deleted the spec/dataset-provider-facet branch March 31, 2022 09:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add StorageDatasetFacet to spec #620

add StorageDatasetFacet to spec #620

pawel-big-lebowski commented Mar 21, 2022

julienledem left a comment

julienledem Mar 22, 2022

pawel-big-lebowski Mar 25, 2022

pawel-big-lebowski Mar 28, 2022

julienledem Mar 29, 2022

pawel-big-lebowski Mar 30, 2022

julienledem Mar 22, 2022

mobuchowski Mar 22, 2022

julienledem Mar 29, 2022 •

edited

Loading

julienledem commented Mar 29, 2022

add StorageDatasetFacet to spec #620

add StorageDatasetFacet to spec #620

Conversation

pawel-big-lebowski commented Mar 21, 2022

Problem

Solution

Checklist

julienledem left a comment

Choose a reason for hiding this comment

julienledem Mar 22, 2022

Choose a reason for hiding this comment

pawel-big-lebowski Mar 25, 2022

Choose a reason for hiding this comment

pawel-big-lebowski Mar 28, 2022

Choose a reason for hiding this comment

julienledem Mar 29, 2022

Choose a reason for hiding this comment

pawel-big-lebowski Mar 30, 2022

Choose a reason for hiding this comment

julienledem Mar 22, 2022

Choose a reason for hiding this comment

mobuchowski Mar 22, 2022

Choose a reason for hiding this comment

julienledem Mar 29, 2022 • edited Loading

Choose a reason for hiding this comment

julienledem commented Mar 29, 2022

julienledem Mar 29, 2022 •

edited

Loading