-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add StorageDatasetFacet to spec #620
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for starting this
{ | ||
"type": "object", | ||
"properties": { | ||
"datasetProvider": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are already in the "provider" facet, this would be "name".
We should document a list of known providers to ensure consistency. Would that apply to other things than Table Formats? Possibly a name more specific than "provider" would be appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the discussion we have, term provider
seems to be kind of misleading.
Official iceberg and delta definitions are:
Iceberg is a high-performance format for huge analytic tables
Delta Lake is a storage layer that brings scalable, ACID transactions to Apache Spark and other big-data engines.
Following that, we could name it storage
of the type DatasetStorageDatasetFacet
with the fields:
storageLayer
which can beiceberg
,delta
fileFormat
which can beparquet
,orc
, etc.
Alternatively we could go with StorageLayerDatasetFacet
which is also OK but has a limited extensibility in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified facet into DatasetStorageDatasetFacet
as it describes better its content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
The name of the field (here "storage") should be consistent with the name of the type so that makes it "StorageDatasetFacet"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments. I've applied the changes accordingly.
"description": "Dataset provider like iceberg, delta-lake, etc.", | ||
"type": "string" | ||
}, | ||
"datasetFormat": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment: we should document known formats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the description here should be more directive:
It is "iceberg" or "delta"
not "like iceberg or delta"
fcb6a0b
to
19b723e
Compare
I made few last comments. Otherwise this looks great to me. |
19b723e
to
86ddaa3
Compare
Signed-off-by: Pawel Leszczynski <[email protected]>
7003f7d
to
b0d21f7
Compare
Signed-off-by: Pawel Leszczynski [email protected]
Problem
Spark integration collects table provider information like delta or iceberg within custom facet. This information becomes useful together with a dataset version collected as version is retrieved from the provider. To make it consistent, if a version is within a global spec, the provider information should also be put there, however a more generic name should be considered like storageProvider, backendStorageProvider, storageProperties with fields: format or storageProvider
Closes: #619
Solution
Add
datasetProvider
facet to spec.If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports
S3
andGCS
filesystem operations, tested with AWS EMR).Checklist
CHANGELOG.md
with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)