-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading parquet files with a list of groups #8606
Comments
What you want to accomplish would be doable in theory but would require significant coding effort in practice. It would in particular require that the GeoJSON driver implements the ArrowStream interface directly (instead of relying of the generic implementation like currently) AND that it has complicated logic to guess the ArrowSchema type from arbitrary JSON constructs made of nested list and maps. The current behaviour is that as soon as the GeoJSON driver sees that a property is not a native OGR type, it ingests it as a JSON serialized field, hence the String(JSON) typing of it. |
@rouault - Thanks for the reply. I realize that I mixed two issues here: reading a Parquet file with a list of structs and writing a Parquet file with a list of structs. I understand that the GeoJSON driver serializes that struct list type as a string when writing the Parquet file. In terms of reading a list of structs from an existing Parquet file, I see that you are working toward support for that (f5e3bfd). |
Arrow/Parquet: add support for reading list (or map) of struct (relates to #8606)
@tschaub Looking at planetlabs/gpq#102, it seems the issue is more about the reading side of the GDAL GeoParquet driver than about having a smart GeoJSON -> Parquet. The issue with OvertureMap files like https://overturemaps-us-west-2.s3.amazonaws.com/release/2023-10-19-alpha.0/theme=buildings/type=building/part-00769-87dd7d19-acc8-4d4f-a5ba-20b407a79638.c000.zstd.parquet" However when trying to read the result of "./gpq convert part-00769-87dd7d19-acc8-4d4f-a5ba-20b407a79638.c000.zstd.parquet test.geo.parquet --from="parquet" --to="geoparquet", I do get a "ReadNext() failed: Malformed levels. min: 2 max: 2 out of range. Max Level: 1" error. This error comes from the Arrow C++ library used by GDAL.
So either gpq writes invalid Parquet, or it writes a flavor of Parquet not understood by the Parquet reader of Arrow C++ |
@rouault - apologies for mixing in the GeoJSON conversion issue - the primary issue I was responding to was about reading columns with a list of structs. And it looks like you've addressed that with #8608, so I'll close this issue. And you are right, the remaining issue is either my misuse of the Go package, an issue with how the Go package writes Parquet, or an issue with how the C++ package reads Parquet. I've ticketed this as apache/arrow#38503. The Overture data sample is pretty unwieldy - I'd like to come up with a more minimal test case, but so far my efforts to filter the data make the problem go away. |
Expected behavior and actual behavior.
I expected that a Parquet file with a logical
LIST
field where the list elements are group (or "struct") could be read.I've attached an Archive.zip with three files:
input.geojson
- A GeoJSON file with agroups
property that is a list of objects.expected.parquet
- A Parquet file that encodes thegroups
as aLIST
field where each element is a group.actual.parquet
- The output I get fromogr2ogr
where thegroups
field is aSTRING
.Expected Parquet schema:
Actual Parquet schema:
Steps to reproduce the problem.
Operating system
macOS 13.3.1
GDAL version and provenance
GDAL 3.6.4, released 2023/04/17
The text was updated successfully, but these errors were encountered: