Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should metadata be written in a partitioned dataset? #79

Open
kylebarron opened this issue Apr 22, 2022 · 5 comments
Open

How should metadata be written in a partitioned dataset? #79

kylebarron opened this issue Apr 22, 2022 · 5 comments
Milestone

Comments

@kylebarron
Copy link
Collaborator

So far the spec has only covered single-file Parquet data. However Parquet also supports saving as a "dataset", where there are several Parquet files in a folder structure. In this case, how should geospatial metadata be stored? There's a Parquet best practice that writes _common_metadata and _metadata sidecar files to the root of the folder structure, but that's not part of the actual Parquet specification.

If I understand correctly, the geo metadata would automatically be included in the _common_metadata file, and additionally statistics are stored in the _metadata file, which is relevant for #13

Should this be part of the geoparquet spec? Should it be a "best practice" that we document?

@cholmes cholmes added this to the 0.4 milestone Apr 22, 2022
@kylebarron
Copy link
Collaborator Author

Adding here for discussion from #101 (comment); we should clarify what the bounding box represents.

Is it possible for each file's metadata to contain only its bounding box, but for the _metadata file to contain metadata for the entire dataset's bounding box? Otherwise, we should probably have the same bounding box in every file, representing the entire dataset?

@cholmes
Copy link
Member

cholmes commented Oct 24, 2022

Where do we stand on this in relation latest discussion pushing for stac 1.0.0-beta.1 sooner rather than later. Should we attempt to put something in there? Has anyone experimented with this and have a good recommendation here?

@cholmes cholmes modified the milestones: 1.0.0-beta.1, future Nov 7, 2022
@cholmes cholmes modified the milestones: future, 1.1 May 2, 2024
@cholmes
Copy link
Member

cholmes commented May 29, 2024

Do we want to do something here for 1.1? We've now got 'spatial optimizations' ready to go for 1.1, so it seems like a good time to get this to completion to. It sounds like libraries do make use of the _metadata sidecar files, in the comment referenced above Even said

[GDAL] uses the arrow-dataset library helpers. When it finds _metadata, it uses the arrow::dataset::ParquetDatasetFactory class, which avoids reading each fragment file for most operations (except for GetExtent() since there's no way of having the global extent without iterating over each fragment metadtaa). When it doesn't find _metadata, it uses the arrow::dataset::FileSystemDatasetFactory class (not sure if that one needs to open each file)'

Do we want to explicitly say that the _metadata sidecar sets the extent of the whole dataset? Or just leave things as they are?

@kylebarron
Copy link
Collaborator Author

_metadata sidecar sets the extent of the whole dataset

I don't think there's an API (at least in pyarrow) to manually set the geoparquet metadata on the _metadata file, so the bbox in the file metadata of the _metadata file would be wrong I think. @jorisvandenbossche might know better

@cholmes cholmes modified the milestones: 1.1, 1.2 Jun 3, 2024
@cholmes
Copy link
Member

cholmes commented Jun 3, 2024

Discussed on call 6/3/24 - this should be in best practices, and is not dependent on the specification, so moved off the 1.1 release. We hope to do best practices soon, perhaps make a 'milestone' and push on the things we want there soon after 1.1, but not block release on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants