Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading parquet files with a list of groups #8606

Closed
tschaub opened this issue Oct 24, 2023 · 4 comments
Closed

Reading parquet files with a list of groups #8606

tschaub opened this issue Oct 24, 2023 · 4 comments

Comments

@tschaub
Copy link

tschaub commented Oct 24, 2023

Expected behavior and actual behavior.

I expected that a Parquet file with a logical LIST field where the list elements are group (or "struct") could be read.

I've attached an Archive.zip with three files:

  • input.geojson - A GeoJSON file with a groups property that is a list of objects.
  • expected.parquet - A Parquet file that encodes the groups as a LIST field where each element is a group.
  • actual.parquet - The output I get from ogr2ogr where the groups field is a STRING.

Expected Parquet schema:

message {
  optional binary geometry;
  optional group groups (LIST) {
    repeated group list {
      optional group element {
        optional double a;
        optional double b;
      }
    }
  }
}

Actual Parquet schema:

message {
  optional binary groups (STRING);
  optional binary geometry;
}

Steps to reproduce the problem.

ogr2ogr actual.parquet input.geojson

Operating system

macOS 13.3.1

GDAL version and provenance

GDAL 3.6.4, released 2023/04/17

@rouault
Copy link
Member

rouault commented Oct 25, 2023

What you want to accomplish would be doable in theory but would require significant coding effort in practice. It would in particular require that the GeoJSON driver implements the ArrowStream interface directly (instead of relying of the generic implementation like currently) AND that it has complicated logic to guess the ArrowSchema type from arbitrary JSON constructs made of nested list and maps. The current behaviour is that as soon as the GeoJSON driver sees that a property is not a native OGR type, it ingests it as a JSON serialized field, hence the String(JSON) typing of it.

rouault added a commit to rouault/gdal that referenced this issue Oct 25, 2023
rouault added a commit to rouault/gdal that referenced this issue Oct 25, 2023
@tschaub
Copy link
Author

tschaub commented Oct 25, 2023

@rouault - Thanks for the reply. I realize that I mixed two issues here: reading a Parquet file with a list of structs and writing a Parquet file with a list of structs. I understand that the GeoJSON driver serializes that struct list type as a string when writing the Parquet file. In terms of reading a list of structs from an existing Parquet file, I see that you are working toward support for that (f5e3bfd).

rouault added a commit to rouault/gdal that referenced this issue Oct 25, 2023
rouault added a commit to rouault/gdal that referenced this issue Oct 25, 2023
rouault added a commit to rouault/gdal that referenced this issue Oct 25, 2023
rouault added a commit that referenced this issue Oct 26, 2023
Arrow/Parquet: add support for reading list (or map) of struct (relates to #8606)
@rouault
Copy link
Member

rouault commented Oct 26, 2023

@tschaub Looking at planetlabs/gpq#102, it seems the issue is more about the reading side of the GDAL GeoParquet driver than about having a smart GeoJSON -> Parquet. The issue with OvertureMap files like https://overturemaps-us-west-2.s3.amazonaws.com/release/2023-10-19-alpha.0/theme=buildings/type=building/part-00769-87dd7d19-acc8-4d4f-a5ba-20b407a79638.c000.zstd.parquet"
has been solved per #8608

However when trying to read the result of "./gpq convert part-00769-87dd7d19-acc8-4d4f-a5ba-20b407a79638.c000.zstd.parquet test.geo.parquet --from="parquet" --to="geoparquet", I do get a "ReadNext() failed: Malformed levels. min: 2 max: 2 out of range. Max Level: 1" error. This error comes from the Arrow C++ library used by GDAL.
It can also be reproduced with the "parquet-reader" utility provided with Arrow C++ :

$ ~/arrow/cpp/build/release/parquet-reader test.geo.parquet >/dev/null
Parquet error: Malformed levels. min: 2 max: 2 out of range.  Max Level: 1

So either gpq writes invalid Parquet, or it writes a flavor of Parquet not understood by the Parquet reader of Arrow C++

@tschaub
Copy link
Author

tschaub commented Oct 29, 2023

@rouault - apologies for mixing in the GeoJSON conversion issue - the primary issue I was responding to was about reading columns with a list of structs. And it looks like you've addressed that with #8608, so I'll close this issue.

And you are right, the remaining issue is either my misuse of the Go package, an issue with how the Go package writes Parquet, or an issue with how the C++ package reads Parquet. I've ticketed this as apache/arrow#38503. The Overture data sample is pretty unwieldy - I'd like to come up with a more minimal test case, but so far my efforts to filter the data make the problem go away.

@tschaub tschaub closed this as completed Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants