GH-43994: [C++][Parquet] Fix schema conversion from two-level encoding nested list #43995

wgtmac · 2024-09-06T16:26:50Z

Rationale for this change

The current C++ parquet implementation interprets following parquet schema as `array<struct<array:array>>, which is wrong:

  optional group a (LIST) {
    repeated group array (LIST) {
      repeated int32 array;
    }
  }

What changes are included in this PR?

According to the parquet spec, the above schema should be inferred as array<array<int>>.

Are these changes tested?

Yes, a test case has been added to verify the fix.

Are there any user-facing changes?

No.

GitHub Issue: [C++][Parquet] Fix schema conversion from two-level encoding nested list #43994

wgtmac · 2024-09-06T16:29:50Z

@emkornfield @pitrou @mapleFU Would you mind taking a look? Thanks!

mapleFU · 2024-09-06T18:47:45Z

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists

Without legacy:

The element field encodes the list's element type and repetition. Element repetition must be required or optional.

With backward capability:

Some existing data does not include the inner element layer. For backward-compatibility, the type of elements in LIST-annotated structures should always be determined by the following rules:

If the repeated field is not a group, then its type is the element type and elements are required.
If the repeated field is a group with multiple fields, then its type is the element type and elements are required.
If the repeated field is a group with one field and is named either array or uses the LIST-annotated group's name with _tuple appended then the repeated type is the element type and elements are required.
Otherwise, the repeated field's type is the element type with the repeated field's repetition.

So, seems this hit the (1)?

mapleFU · 2024-09-06T18:56:05Z

Parquet schema is too tricky for me, I'd try to take a look at https://github.com/apache/parquet-java/blob/aec7bc64dffa373db678ab2fc8b46565b4c011a5/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftSchemaConvertVisitor.java#L220 tomorrow...

mapleFU · 2024-09-07T04:03:47Z

I've check Java related code:
https://github.com/apache/parquet-java/blob/aec7bc64dffa373db678ab2fc8b46565b4c011a5/parquet-column/src/main/java/org/apache/parquet/schema/Type.java

https://github.com/apache/parquet-java/blob/aec7bc64dffa373db678ab2fc8b46565b4c011a5/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L145

I'll dive into it this after noon

mapleFU

Our ListToSchemaField is like this part of the code https://github.com/apache/parquet-java/blob/aec7bc64dffa373db678ab2fc8b46565b4c011a5/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L397-L421

Should we port the impl and testings in that?

wgtmac · 2024-09-08T07:50:48Z

  optional group a (LIST) {
    repeated group array (LIST) {
      repeated int32 array;
    }
  }

IMO, the root cause is that the current code recognizes the schema above as a three-level encoding. However, the inner-most field can only be required or optional in three-level encoding, but here the int32 field is repeated. We can decouple the nesting field into two lists as below:

  outer_list:
  optional group a (LIST) {
    repeated group array (LIST) {}
  }

  inner_list:
  repeated group array (LIST) {
    repeated int32 array;
  }

It is obvious that inner_list can simply apply backward-compatibility rule (1). For the outer_list, the current code applies rule (3). I think we need to apply rule (4) here by modifying the rule (3) to below:

If the repeated field is a group with one required or optional field and is named either array or uses the LIST-annotated group's name with _tuple appended then the repeated type is the element type and elements are required.

mapleFU · 2024-09-08T07:52:54Z

Yes. It's so tricky, I think we can just copying the Java code directly, lol

wgtmac · 2024-09-08T07:54:39Z

Our ListToSchemaField is like this part of the code https://github.com/apache/parquet-java/blob/aec7bc64dffa373db678ab2fc8b46565b4c011a5/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L397-L421

Should we port the impl and testings in that?

I think we are just missing check of this line: https://github.com/apache/parquet-java/blob/aec7bc64dffa373db678ab2fc8b46565b4c011a5/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L588

mapleFU · 2024-09-08T07:56:58Z

What about this line: https://github.com/apache/parquet-java/blob/aec7bc64dffa373db678ab2fc8b46565b4c011a5/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L587 ?

mapleFU · 2024-09-08T08:04:04Z

I think we are just missing check of this line

This fixing itself LGTM, but I just think should we test and align more...

pitrou · 2024-09-10T16:07:58Z

The current C++ parquet implementation interprets following parquet schema as `array<structarray:array>, which is wrong:

What is "array"? Do you mean "list"? Can you fix the PR description?

According to the parquet spec, the above schema should be inferred as array<array<int>>.

Where is this in the Parquet spec? I cannot find a similar example.

I have seen an issue when reading a Parquet file created by Hudi.

Can we check with the Parquet ML whether this is really a legitimate schema structure?
If so, can we add a testing file in parquet-testing?

mapleFU · 2024-09-10T16:17:35Z

Where is this in the Parquet spec? I cannot find a similar example.

The wording of the spec is very ambigious:

If the repeated field is not a group, then its type is the element type and elements are required.
If the repeated field is a group with multiple fields, then its type is the element type and elements are required.
If the repeated field is a group with one field and is named either array or uses the LIST-annotated group's name with _tuple appended then the repeated type is the element type and elements are required.
Otherwise, the repeated field's type is the element type with the repeated field's repetition.

I think this just following the rule(4): repeated field's type is the element type with the repeated field's repetition.

Can we check with the Parquet ML whether this is really a legitimate schema structure?
If so, can we add a testing file in parquet-testing?

I think maybe a testfile would be better

wgtmac · 2024-09-11T01:58:10Z

I‘m using Hive schema, so that's why it is array<array<int>>. The file could be easily produced by Spark Sql like below:

package org.example

import org.apache.spark.sql.SparkSession

object ParquetTwoLevelList {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder
      .master("local[1]")
      .appName("NestedListTest")
      .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog")
      .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
      .config("spark.kryo.registrator", "org.apache.spark.HoodieSparkKryoRegistrar")
      .getOrCreate()
    spark.sql("CREATE TABLE nested_list_test (a array<array<int>>) USING HUDI")
    spark.sql("INSERT INTO nested_list_test VALUES ( array(array(1,2), array(3,4)) )")
  }

}

The parquet-cli prints the following metadata:

File path:  /Users/gangwu/Projects/hudi-spark-generator/spark-warehouse/nested_list_test/f92ed4b5-c063-4b94-90a4-5ef997db1a6c-0_0-13-12_20240911093900996.parquet
Created by: parquet-mr version 1.12.3 (build f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
Properties:
  hoodie_bloom_filter_type_code: DYNAMIC_V0
  org.apache.hudi.bloomfilter: ***
  hoodie_min_record_key: 20240911093900996_0_0
  parquet.avro.schema: {"type":"record","name":"nested_list_test_record","namespace":"hoodie.nested_list_test","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"a","type":["null",{"type":"array","items":["null",{"type":"array","items":["null","int"]}]}],"default":null}]}
  writer.model.name: avro
  hoodie_max_record_key: 20240911093900996_0_0
Schema:
message hoodie.nested_list_test.nested_list_test_record {
  optional binary _hoodie_commit_time (STRING);
  optional binary _hoodie_commit_seqno (STRING);
  optional binary _hoodie_record_key (STRING);
  optional binary _hoodie_partition_path (STRING);
  optional binary _hoodie_file_name (STRING);
  optional group a (LIST) {
    repeated group array (LIST) {
      repeated int32 array;
    }
  }
}


Row group 0:  count: 1  441.00 B records  start: 4  total(compressed): 441 B total(uncompressed):349 B
--------------------------------------------------------------------------------
                        type      encodings count     avg size   nulls   min / max
_hoodie_commit_time     BINARY    G   _     1         68.00 B    0       "20240911093900996" / "20240911093900996"
_hoodie_commit_seqno    BINARY    G   _     1         72.00 B    0       "20240911093900996_0_0" / "20240911093900996_0_0"
_hoodie_record_key      BINARY    G   _     1         72.00 B    0       "20240911093900996_0_0" / "20240911093900996_0_0"
_hoodie_partition_path  BINARY    G   _     1         50.00 B    0       "" / ""
_hoodie_file_name       BINARY    G   _     1         116.00 B   0       "f92ed4b5-c063-4b94-90a4-5..." / "f92ed4b5-c063-4b94-90a4-5..."
a.array.array           INT32     G   _     4         15.75 B    0       "1" / "4"

-------------

mapleFU · 2024-09-11T02:41:05Z

@wgtmac Would you mind check testing file and add one if not exists in parquet-testing?

wgtmac · 2024-09-13T06:27:41Z

I will try to use parquet-java to create a minimal file and add it to parquet-testing. The file created by Hudi is too large due to a file-level bloom filter embedded in the file footer.

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

wgtmac · 2024-10-17T06:42:32Z

Gentle ping :) @emkornfield @pitrou @mapleFU

mapleFU

LGTM

…ncoding nested list

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

emkornfield · 2024-10-29T05:51:08Z

cpp/src/parquet/arrow/schema.cc

@@ -681,6 +681,10 @@ Status ListToSchemaField(const GroupNode& group, LevelInfo current_levels,
      // List of primitive type
      RETURN_NOT_OK(
          NodeToSchemaField(*list_group.field(0), current_levels, ctx, out, child_field));
+    } else if (list_group.field_count() == 1 && list_group.field(0)->is_repeated()) {


BTW, it looks like the HasStructListName is not correct for _tuple, as it only checks the name ends in _tuple and not that it is the top level list appended with _tuple.

emkornfield · 2024-10-29T06:00:00Z

cpp/src/parquet/arrow/schema.cc

@@ -681,6 +681,10 @@ Status ListToSchemaField(const GroupNode& group, LevelInfo current_levels,
      // List of primitive type
      RETURN_NOT_OK(
          NodeToSchemaField(*list_group.field(0), current_levels, ctx, out, child_field));
+    } else if (list_group.field_count() == 1 && list_group.field(0)->is_repeated()) {


i'm not sure this is correct, or at least the comments above need to be updated to explain the logic further.

Specifially from the exapmles on logical type.md:

// List<OneTuple<String>> (nullable list, non-null elements) optional group my_list (LIST) { repeated group array { required binary str (STRING); }; }

This seems to imply that despite how the file was written with Avro bindings there should in fact be an intermediate struct and not a llist<list<>>, its not clear to me if this is a bug in the spec, or a bug in Avro java writer implementation.

The rule (3) of backward-compatibility rules is that If the repeated field is a group with one field and is named either array or uses the LIST-annotated group's name with _tuple appended then the repeated type is the element type and elements are required.. It says that the repeated type is the element type.

optional group my_list (LIST) { repeated group array { required binary str (STRING); }; }

So for the schema you've just mentioned above, its element type is group array { required binary str (STRING); } which perfectly resolves to OneTuple<String>.

optional group a (LIST) { repeated group array (LIST) { repeated int32 array; } }

However, for the schema I've mentioned in this issue, its element type is group array (LIST) { repeated int32 array; } and it perfectly resolves to List<int32> according to rule (1) which is If the repeated field is not a group, then its type is the element type and elements are required..

The parquet-java implementation has interpreted this case in the same way: https://github.com/apache/parquet-java/blob/42cf31c0fbe4f000d4ddb1e1092c6634989ea3ca/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L588

I have opened apache/parquet-format#466 to clarify things.

emkornfield

Commented, its not clear that this is correct, and the bug might be with Avro writer if I am reading the spec correctly.

wgtmac · 2024-10-30T09:59:49Z

cpp/src/parquet/arrow/schema.cc

  ::std::string_view name{node.name()};
-  return name == "array" || EndsWith(name, "_tuple");
+  return name == "array" || name == (parent.name() + "_tuple");


@emkornfield Fix the matching of _tuple to follow the spec.

emkornfield · 2024-10-30T18:02:59Z

cpp/src/parquet/arrow/schema.cc

      RETURN_NOT_OK(GroupToStruct(list_group, current_levels, ctx, out, child_field));
+    } else if (list_group.field_count() == 1) {
+      const auto& repeated_field = list_group.field(0);
+      if (repeated_field->is_repeated()) {


Thanks for the careful explanation on my last questions, after rereading I think I mostly agree that this is a bug that needs to be fixed. I think I missed the second (LIST) annotation. Which even though this check corresponds to the java, code, it seems an important factor here is that the list_group is in fact a (LIST and maybe even possibly a map), not that the inner element is repeated?

So I think the logic might make more sense as (pseudocode):

if (list_group.field_count() > 1) { ... } else if (HasListElementName(list_group, group)) { if (IsMap(list_group)) { RETURN_NOT_OK( ListToMapField(*list_group, current_levels, ctx, out, child_field)); } else if IsList(list_group)) { RETURN_NOT_OK( ListToSchemaField(*list_group, current_levels, ctx, out, child_field)); } else { RETURN_NOT_OK(GroupToStruct(list_group, current_levels, ctx, out, child_field)); } } else { ... }

Does this formulation work?

No, this does not work. It has two issues:

MAP annotation will never happen. I have explained at GH-465: Clarify backward-compatibility rules on LIST type parquet-format#466 (comment). Same for the non-legacy three-level list.

ListToSchemaField() will throw at

arrow/cpp/src/parquet/arrow/schema.cc

Line 635 in 3917b60

return Status::Invalid("LIST-annotated groups must not be repeated.");

due to the spec https://github.com/apache/parquet-format/blob/3478990aa74ffe1c76ab53e7ed8cf8f4f609c7a3/LogicalTypes.md?plain=1#L594, which is too strict. I have changed it to check for only non-legacy list.

wgtmac · 2024-10-31T07:45:22Z

cpp/src/parquet/arrow/arrow_schema_test.cc

@@ -727,6 +780,60 @@ TEST_F(TestConvertParquetSchema, ParquetRepeatedNestedSchema) {
  ASSERT_NO_FATAL_FAILURE(CheckFlatSchema(arrow_schema));
 }

+TEST_F(TestConvertParquetSchema, IllegalParquetNestedSchema) {


This case verifies that three-level list and map cannot be nested in a legacy two-level list.

wgtmac · 2024-10-31T07:47:08Z

cpp/src/parquet/arrow/schema.cc

+  // The Parquet spec requires that LIST-annotated group cannot be repeated when
+  // it applies normal three-level encoding. We need to figure out legacy list
+  // structures and do not enforce this rule for them.
+  bool is_legacy_list_structure = true;


Now I have changed the repetition check for only non-legacy list.

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Sep 6, 2024

wgtmac force-pushed the GH-43994 branch from 606a18a to 7e6cb6d Compare September 6, 2024 16:29

mapleFU requested changes Sep 7, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Sep 7, 2024

wgtmac force-pushed the GH-43994 branch from 7e6cb6d to 1bc2d11 Compare September 20, 2024 14:18

wgtmac commented Sep 20, 2024

View reviewed changes

cpp/src/parquet/arrow/arrow_reader_writer_test.cc Show resolved Hide resolved

wgtmac force-pushed the GH-43994 branch from fada489 to c060545 Compare October 17, 2024 01:29

mapleFU approved these changes Oct 17, 2024

View reviewed changes

wgtmac added 3 commits October 24, 2024 12:49

apacheGH-43994: [C++][Parquet] Fix schema conversion from two-level e…

98b4efd

…ncoding nested list

add a roundtrip test

4c84446

add interop test

f70ae79

wgtmac force-pushed the GH-43994 branch from c060545 to f70ae79 Compare October 24, 2024 06:38

mapleFU approved these changes Oct 24, 2024

View reviewed changes

cpp/src/parquet/arrow/arrow_reader_writer_test.cc Show resolved Hide resolved

wgtmac requested a review from pitrou October 24, 2024 14:06

emkornfield reviewed Oct 29, 2024

View reviewed changes

cpp/src/parquet/arrow/arrow_reader_writer_test.cc Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Oct 29, 2024

emkornfield reviewed Oct 29, 2024

View reviewed changes

emkornfield requested changes Oct 29, 2024

View reviewed changes

This was referenced Oct 30, 2024

Backward-compatibility rules on LIST type is unclear apache/parquet-format#465

Open

GH-465: Clarify backward-compatibility rules on LIST type apache/parquet-format#466

Open

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 30, 2024

wgtmac commented Oct 30, 2024

View reviewed changes

reorganize logic and add comment

7c37271

wgtmac force-pushed the GH-43994 branch from 71c6acf to 7c37271 Compare October 30, 2024 10:08

emkornfield reviewed Oct 30, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 30, 2024

add cases for list<map<_,_> and list<list<_>>>

7c674f9

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 31, 2024

wgtmac commented Oct 31, 2024

View reviewed changes

fix tests

9bd3fa4

wgtmac force-pushed the GH-43994 branch from 686fc98 to 9bd3fa4 Compare November 1, 2024 02:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-43994: [C++][Parquet] Fix schema conversion from two-level encoding nested list #43995

GH-43994: [C++][Parquet] Fix schema conversion from two-level encoding nested list #43995

wgtmac commented Sep 6, 2024 •

edited

Loading

wgtmac commented Sep 6, 2024

mapleFU commented Sep 6, 2024

mapleFU commented Sep 6, 2024

mapleFU commented Sep 7, 2024

mapleFU left a comment

wgtmac commented Sep 8, 2024

mapleFU commented Sep 8, 2024

wgtmac commented Sep 8, 2024

mapleFU commented Sep 8, 2024

mapleFU commented Sep 8, 2024

pitrou commented Sep 10, 2024

mapleFU commented Sep 10, 2024 •

edited

Loading

wgtmac commented Sep 11, 2024

mapleFU commented Sep 11, 2024

wgtmac commented Sep 13, 2024

wgtmac commented Oct 17, 2024

mapleFU left a comment

emkornfield Oct 29, 2024

emkornfield Oct 29, 2024

wgtmac Oct 29, 2024 •

edited

Loading

wgtmac Oct 30, 2024

emkornfield left a comment

wgtmac Oct 30, 2024

emkornfield Oct 30, 2024 •

edited

Loading

wgtmac Oct 31, 2024 •

edited

Loading

wgtmac Oct 31, 2024

wgtmac Oct 31, 2024

GH-43994: [C++][Parquet] Fix schema conversion from two-level encoding nested list #43995

Are you sure you want to change the base?

GH-43994: [C++][Parquet] Fix schema conversion from two-level encoding nested list #43995

Conversation

wgtmac commented Sep 6, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

wgtmac commented Sep 6, 2024

mapleFU commented Sep 6, 2024

mapleFU commented Sep 6, 2024

mapleFU commented Sep 7, 2024

mapleFU left a comment

Choose a reason for hiding this comment

wgtmac commented Sep 8, 2024

mapleFU commented Sep 8, 2024

wgtmac commented Sep 8, 2024

mapleFU commented Sep 8, 2024

mapleFU commented Sep 8, 2024

pitrou commented Sep 10, 2024

mapleFU commented Sep 10, 2024 • edited Loading

wgtmac commented Sep 11, 2024

mapleFU commented Sep 11, 2024

wgtmac commented Sep 13, 2024

wgtmac commented Oct 17, 2024

mapleFU left a comment

Choose a reason for hiding this comment

emkornfield Oct 29, 2024

Choose a reason for hiding this comment

emkornfield Oct 29, 2024

Choose a reason for hiding this comment

wgtmac Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

wgtmac Oct 30, 2024

Choose a reason for hiding this comment

emkornfield left a comment

Choose a reason for hiding this comment

wgtmac Oct 30, 2024

Choose a reason for hiding this comment

emkornfield Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

wgtmac Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

wgtmac Oct 31, 2024

Choose a reason for hiding this comment

wgtmac Oct 31, 2024

Choose a reason for hiding this comment

wgtmac commented Sep 6, 2024 •

edited

Loading

mapleFU commented Sep 10, 2024 •

edited

Loading

wgtmac Oct 29, 2024 •

edited

Loading

emkornfield Oct 30, 2024 •

edited

Loading

wgtmac Oct 31, 2024 •

edited

Loading