Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify support for zero geometry columns #165

Open
himikof opened this issue Jan 27, 2023 · 2 comments
Open

Clarify support for zero geometry columns #165

himikof opened this issue Jan 27, 2023 · 2 comments

Comments

@himikof
Copy link

himikof commented Jan 27, 2023

Should the format allow for no geometry columns in a file?

I think that it should, because it is occasionally useful: for example, a tool converting a schema-less (GeoJSON-like) input to GeoParquet on an empty input has to either guess about the geometry columns, or to error out. Both options seem less than ideal.
Also, geopandas.GeoDataFrame().to_parquet(...) is a similar case and should do something reasonable and compliant.

In the spec is currently written I think the answer is yes, but in a counter-intuitive way: columns can be empty, and the required primary_column field could be anything:

The name of the "primary" geometry column. In cases where a GeoParquet file contains multiple geometry columns, the primary geometry may be used by default in geospatial operations.

There are no requirements for primary_column to be actually contained in columns, so the spec could be taken to mean "the name that would be used if there would be any (multiple?) geometry columns".
Interestingly, the current implementation of geopandas.GeoDataFrame().to_parquet agrees, writing {"primary_column": "geometry", "columns": {}} in the metadata.

But the JSON schema contradicts the written spec here, requiring the columns to be non-empty, and there is an additional check for primary_column to be contained in columns.

I think this corner case is important enough to be made explicit in the specification, either by making primary_column optional/nullable and empty columns valid, or by explicitly allowing {"primary_column": "geometry", "columns": {}}.
Or, alternatively, by explicitly prohibiting this case in the specification, even if I think that would be rather unfortunate.

@jorisvandenbossche
Copy link
Collaborator

What is the advantage of having the metadata without columns (in which case it is basically just a version, i.e. "geo": {"columns": [], "version": "1.0.0-dev"}), compared to just leaving out the "geo" metadata?

@himikof
Copy link
Author

himikof commented Jan 28, 2023

I think that would depend on whether the "geo" metadata would be made optional in the specification (that is, if any other parquet file would be a valid GeoParquet file, just without geo-columns). If that is true, then leaving out "geo" metadata could be indeed another solution. But if the spec would require a valid GeoParquet file to always have this metadata (as it currently does), then it would be very strange for a GeoDataFrame (or something like that) GeoParquet serialization function to write a file that:

  1. is not valid according to the GeoParquet specification;
  2. would error out due to the missing required metadata in many (if not all) readers.

It seems like a trade-off between supporting missing "geo" metadata (and possibly interpreting a totally unrelated parquet file as GeoParquet) and empty "columns" (which likely requires a bit of careful handling) in GeoParquet readers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants