Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difficulty reading parquet files externally #2226

Closed
joosthooz opened this issue Mar 17, 2023 · 3 comments
Closed

Difficulty reading parquet files externally #2226

joosthooz opened this issue Mar 17, 2023 · 3 comments

Comments

@joosthooz
Copy link

Describe the bug
After Tempo compacts some data into a data.parquet file, I'd like to view this data with an external tool, but I have not been able to do so. I've tried parquet-tools, bdt, pandas (with both parquet engines), pyarrow (10.0.1 and a pretty recent build), and even parquet-mr. Could you recommend me a tool to use for inspecting these file contents?

For example, this is the output from parquet-mr:

java -cp 'target/parquet-cli-1.13.0-SNAPSHOT.jar:target/dependency/*' org.apache.pa
rquet.cli.Main head ../../data.parquet 
Unknown error
java.lang.UnsupportedOperationException: REPEATED not supported outside LIST or MAP. Type: repeated group Attrs {
  required binary Key (STRING);
  optional binary Value (STRING);
  optional int64 ValueInt (INTEGER(64,true));
  optional double ValueDouble;
  optional boolean ValueBool;
  optional binary ValueKVList (STRING);
  optional binary ValueArray (STRING);
}
	at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:292)
	at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:440)
	at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:290)
	at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:440)
	at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:290)
	at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:279)
	at org.apache.parquet.cli.util.Schemas.fromParquet(Schemas.java:89)
	at org.apache.parquet.cli.BaseCommand.getAvroSchema(BaseCommand.java:405)
	at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:66)
	at org.apache.parquet.cli.Main.run(Main.java:163)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
	at org.apache.parquet.cli.Main.main(Main.java:193)
@joe-elliott
Copy link
Member

We have seen similar issues where various Parquet tooling doesn't support the same features that parquet-go does. We are working on a vparquet2 version that will attempt to fix some of these.

I believe the error pictured is being addressed, but I will let the engineer (@stoewer) who is working on it comment.

@stoewer
Copy link
Contributor

stoewer commented Mar 19, 2023

Tempo vParquet uses a schema that contains repeated groups that are not properly structured as nested list type. This causes the above error with the parquet-mr/parquet-cli tool.
As mentioned by @joe-elliott, this issue will be addressed in vParquet2.

In the meantime you could try stoewer/parquet-cli to gain insights about Tempo blocks (but keep in mind that this is an experimental tool and not meant to be used in production).

@joosthooz
Copy link
Author

Thank you for pointing me to the parquet-cli project, I cannot build it so I created an issue on that repo. But now I've found a way to log traceIDs, there's less of a need for me to inspect the data itself. I'll close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants