How should metadata be written in a partitioned dataset? #79

kylebarron · 2022-04-22T17:27:15Z

So far the spec has only covered single-file Parquet data. However Parquet also supports saving as a "dataset", where there are several Parquet files in a folder structure. In this case, how should geospatial metadata be stored? There's a Parquet best practice that writes _common_metadata and _metadata sidecar files to the root of the folder structure, but that's not part of the actual Parquet specification.

If I understand correctly, the geo metadata would automatically be included in the _common_metadata file, and additionally statistics are stored in the _metadata file, which is relevant for #13

Should this be part of the geoparquet spec? Should it be a "best practice" that we document?

The text was updated successfully, but these errors were encountered:

kylebarron · 2022-05-19T20:32:47Z

Adding here for discussion from #101 (comment); we should clarify what the bounding box represents.

Is it possible for each file's metadata to contain only its bounding box, but for the _metadata file to contain metadata for the entire dataset's bounding box? Otherwise, we should probably have the same bounding box in every file, representing the entire dataset?

cholmes · 2022-10-24T23:24:16Z

Where do we stand on this in relation latest discussion pushing for stac 1.0.0-beta.1 sooner rather than later. Should we attempt to put something in there? Has anyone experimented with this and have a good recommendation here?

cholmes · 2024-05-29T03:28:54Z

Do we want to do something here for 1.1? We've now got 'spatial optimizations' ready to go for 1.1, so it seems like a good time to get this to completion to. It sounds like libraries do make use of the _metadata sidecar files, in the comment referenced above Even said

[GDAL] uses the arrow-dataset library helpers. When it finds _metadata, it uses the arrow::dataset::ParquetDatasetFactory class, which avoids reading each fragment file for most operations (except for GetExtent() since there's no way of having the global extent without iterating over each fragment metadtaa). When it doesn't find _metadata, it uses the arrow::dataset::FileSystemDatasetFactory class (not sure if that one needs to open each file)'

Do we want to explicitly say that the _metadata sidecar sets the extent of the whole dataset? Or just leave things as they are?

kylebarron · 2024-05-30T08:29:06Z

_metadata sidecar sets the extent of the whole dataset

I don't think there's an API (at least in pyarrow) to manually set the geoparquet metadata on the _metadata file, so the bbox in the file metadata of the _metadata file would be wrong I think. @jorisvandenbossche might know better

cholmes · 2024-06-03T17:51:07Z

Discussed on call 6/3/24 - this should be in best practices, and is not dependent on the specification, so moved off the 1.1 release. We hope to do best practices soon, perhaps make a 'milestone' and push on the things we want there soon after 1.1, but not block release on it.

cholmes added this to the 0.4 milestone Apr 22, 2022

kylebarron mentioned this issue Jun 14, 2022

BigQuery scripts to import/export GeoParquet files #113

Closed

mojodna mentioned this issue Sep 19, 2022

Consider externalizability of metadata #118

Open

cholmes modified the milestones: 1.0.0-beta.1, future Nov 7, 2022

cholmes modified the milestones: future, 1.1 May 2, 2024

cholmes mentioned this issue May 29, 2024

Start a 'best practices' document #223

Open

cholmes modified the milestones: 1.1, 1.2 Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should metadata be written in a partitioned dataset? #79

How should metadata be written in a partitioned dataset? #79

kylebarron commented Apr 22, 2022

kylebarron commented May 19, 2022

cholmes commented Oct 24, 2022

cholmes commented May 29, 2024

kylebarron commented May 30, 2024

cholmes commented Jun 3, 2024

How should metadata be written in a partitioned dataset? #79

How should metadata be written in a partitioned dataset? #79

Comments

kylebarron commented Apr 22, 2022

kylebarron commented May 19, 2022

cholmes commented Oct 24, 2022

cholmes commented May 29, 2024

kylebarron commented May 30, 2024

cholmes commented Jun 3, 2024