Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues using parquet-cli to read created files with repeated fields or LIST #151

Open
johanfunnel opened this issue Nov 13, 2024 · 0 comments

Comments

@johanfunnel
Copy link

johanfunnel commented Nov 13, 2024

We're using parquet-cli (brew install parquet-cli) to read files that created with this lib, but we're running in to issues with either errors or empty values for fields with repeated: true and/or type: 'LIST'. Reading using ParquetReader.openFile from this lib works fine though!

Steps to reproduce

Example 1 - repeated: true

Using the following schema and code, based on this README example

const schema = new ParquetSchema({
  id: { type: 'UTF8' },
  stock: {
    repeated: true,
    fields: {
      price: { type: 'DOUBLE' },
      quantity: { type: 'INT64' },
    },
  },
});

const writer = await ParquetWriter.openFile(
  schema,
  'repeated-example.parquet'
);

await writer.appendRow({
  id: 'Row1',
  stock: [
    { price: 100, quantity: 10 },
    { price: 200, quantity: 20 },
  ],
});

Example 2 - type: 'LIST'

Using the following schema and code, based on the tests for array list

const schema = new ParquetSchema({
  id: { type: 'UTF8' },
  test: {
    type: 'LIST',
    fields: {
      list: {
        repeated: true,
        fields: {
          element: {
            type: 'UTF8',
          },
        },
      },
    },
  },
});

const writer = await ParquetWriter.openFile(schema, 'list-example.parquet');

await writer.appendRow({
  id: 'Row1',
  test: { list: [{ element: 'abcdef' }, { element: 'fedcba' }] },
});
  1. Generate files using the examples above
  2. Read these files with parquet-cli using parquet cat <path-to-file>.

Expected behaviour

Example 1
Being able to read the file without errors.

Example 2
The result having { list: [ { element: 'abcdef' }, { element: 'fedcba' } ] } in the test field, like when reading the file using ParquetReader.openFile.

Actual behaviour

Example 1
An error is thrown, see under Error logs

Example 2
Getting the result {"id": "Row1", "test": null}

Error logs

From Example 1

Unknown error
java.lang.RuntimeException: Failed on record 0 in <omitted>/output-basic.parquet
	at org.apache.parquet.cli.commands.ScanCommand.run(ScanCommand.java:75)
	at org.apache.parquet.cli.Main.run(Main.java:163)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.parquet.cli.Main.main(Main.java:191)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:<omitted>/output-basic.parquet
	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:280)
	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:140)
	at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:356)
	at org.apache.parquet.cli.BaseCommand$1$1.<init>(BaseCommand.java:337)
	at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:335)
	at org.apache.parquet.cli.commands.ScanCommand.run(ScanCommand.java:70)
	... 3 more
Caused by: org.apache.parquet.io.ParquetDecodingException: The requested schema is not compatible with the file schema. incompatible types: required group stock (LIST) {
  repeated group array {
    required double price;
    required int64 quantity;
  }
} != repeated group stock {
  required double price;
  required int64 quantity;
}
	at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:104)
	at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:81)
	at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57)
	at org.apache.parquet.schema.MessageType.accept(MessageType.java:52)
	at org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:167)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:155)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:245)
	... 9 more
@johanfunnel johanfunnel changed the title Issues using parquet-cli to read files with repeated fields or LIST Issues using parquet-cli to read created files with repeated fields or LIST Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant