Arrow, Spark 3.4: Support vectorized reads with struct constants #8466

aokolnychyi · 2023-09-01T22:26:21Z

Our merge-on-read queries can't benefit from vectorized reads because of _partition metadata column being projected for the write distribution. This PR adapts our Arrow and Spark 3.4 logic to support such structs.

aokolnychyi · 2023-09-01T22:29:44Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java

@@ -142,6 +152,12 @@ public ConstantVectorHolder(int numRows, T constantValue) {
      this.constantValue = constantValue;
    }

+    public ConstantVectorHolder(Types.NestedField icebergField, int numRows, T constantValue) {
+      super(icebergField);


Each VectorHolder actually has icebergField but we were not setting it for constants.

aokolnychyi · 2023-09-01T22:30:27Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

@@ -119,6 +123,10 @@ private enum ReadType {
    DICTIONARY
  }

+  protected Types.NestedField icebergField() {


Exposing it to all readers so that we can construct typed constant vectors later.

aokolnychyi · 2023-09-01T22:31:34Z

...k/v3.4/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/SparkPlanUtil.java

+import org.apache.spark.sql.execution.datasources.v2.BatchScanExec;
+import scala.collection.Seq;
+
+public class SparkPlanUtil {


This is located in tests and uses AdaptiveSparkPlanHelper from Spark. Otherwise, I would have to write a lot of ugly Java code to work with Scala SparkPlan (e.g. unwrap AQE).

aokolnychyi · 2023-09-01T22:31:56Z

...sions/src/test/java/org/apache/iceberg/spark/extensions/SparkRowLevelOperationsTestBase.java

@@ -172,7 +177,9 @@ protected void initTable() {
            tableName, PARQUET_VECTORIZATION_ENABLED, vectorized);
        break;
      case "orc":
-        Assert.assertTrue(vectorized);
+        sql(


This was not set correctly before, probably from earlier days.

aokolnychyi · 2023-09-01T22:33:00Z

...k/v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnVectorBuilder.java

@@ -38,8 +39,10 @@ public ColumnVector build(VectorHolder holder, int numRows) {
      if (holder instanceof VectorHolder.DeletedVectorHolder) {
        return new DeletedColumnVector(Types.BooleanType.get(), isDeleted);
      } else if (holder instanceof ConstantVectorHolder) {
-        return new ConstantColumnVector(
-            Types.IntegerType.get(), numRows, ((ConstantVectorHolder<?>) holder).getConstant());


This was the primary problem: we always assumed metadata columns were integers.

aokolnychyi · 2023-09-01T22:34:00Z

.../v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ConstantColumnVector.java

  private final Object constant;
  private final int batchSize;

-  ConstantColumnVector(Type type, int batchSize, Object constant) {


I renamed type to icebergType cause the parent class already provides type variable but it is a Spark type.

aokolnychyi · 2023-09-01T22:34:33Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatch.java

@@ -114,13 +115,11 @@ public PartitionReaderFactory createReaderFactory() {

  // conditions for using Parquet batch reads:
  // - Parquet vectorization is enabled
-  // - at least one column is projected


I added a test that projecting at least one data column is not a requirement.

aokolnychyi · 2023-09-01T22:35:22Z

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestParquetScan.java

+  }
+
+  @Test
+  public void testEmptyTableProjection() throws IOException {


This is the test that uses an empty projection.

jerqi · 2023-09-02T01:58:12Z

spark/v3.4/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java

+    Assert.assertEquals("Should have 3 snapshots", 3, Iterables.size(table.snapshots()));
+
+    Snapshot currentSnapshot = SnapshotUtil.latestSnapshot(table, branch);
+    if (mode(table) == COPY_ON_WRITE) {


Should we make the method isCopyOnWrite protected and reuse the method isCopyOnWrite here?

The mode method is used in a lot of places and it is a bit more reliable cause it checks the table. I used it here so that tests are consistent. We may reconsider that in a follow-up but I don't think it is a big deal.

Thanks, I got it.

aokolnychyi · 2023-09-02T03:21:19Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java

@@ -59,12 +59,17 @@ public VectorHolder(

  // Only used for returning dummy holder
  private VectorHolder() {
+    this(null);


I will need to think more about untyped null holders. I am not sure it is a good idea to have them.

cc @rdblue @RussellSpitzer @nastra @flyrain

While it is arguable, the current solution works as null checks are performed prior to accessing values. Supporting typed null vectors would be a substantial change in our Arrow codebase and I am not convinced it would be worth it. Keeping as-is for now, can be done in the future.

Another option is to make icebergField non-final and protected. Then we don't have to change the constructor, we can add it in here. Either works to me though since we got a null-check now.

public ConstantVectorHolder(Types.NestedField icebergField, int numRows, T constantValue) { this.icebergField = icebergField; this.numRows = numRows; this.constantValue = constantValue; }

Exposing fields directly would cause a checkstyle violation. I feel the current approach is simple enough.

aokolnychyi · 2023-09-11T17:13:29Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java

      } else if (id == MetadataColumns.ROW_POSITION.fieldId()) {
        if (setArrowValidityVector) {
          reorderedFields.add(VectorizedArrowReader.positionsWithSetArrowValidityVector());
        } else {
          reorderedFields.add(VectorizedArrowReader.positions());
        }
      } else if (id == MetadataColumns.IS_DELETED.fieldId()) {
-        reorderedFields.add(new VectorizedArrowReader.DeletedVectorReader());
+        reorderedFields.add(new DeletedVectorReader());


I had to import ConstantVectorReader to stay on one line above and because we usually prefer direct imports. I changed this line for consistency.

flyrain

LGTM. Left minor comments. Thanks @aokolnychyi for the change. Sorry for the delay.

flyrain · 2023-09-13T18:36:21Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java

@@ -59,12 +59,17 @@ public VectorHolder(

  // Only used for returning dummy holder


Not directly related to your PR. Maybe, we should reword it and reformat it like this

/** * A dummy holder constructor. */

flyrain · 2023-09-13T18:39:24Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java

+    this(null);
+  }
+
+  // Only used for creating constant holders for fields


How about a JavaDoc format like this?

/** * Constructor for constant holders of fields. * * @param field the nested field for the holder. */

flyrain · 2023-09-13T18:52:30Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java

@@ -59,12 +59,17 @@ public VectorHolder(

  // Only used for returning dummy holder
  private VectorHolder() {
+    this(null);


Another option is to make icebergField non-final and protected. Then we don't have to change the constructor, we can add it in here. Either works to me though since we got a null-check now.

public ConstantVectorHolder(Types.NestedField icebergField, int numRows, T constantValue) { this.icebergField = icebergField; this.numRows = numRows; this.constantValue = constantValue; }

aokolnychyi · 2023-09-14T05:17:30Z

Thanks for reviewing, @jerqi @flyrain!

github-actions bot added spark arrow core labels Sep 1, 2023

aokolnychyi commented Sep 1, 2023

View reviewed changes

jerqi reviewed Sep 2, 2023

View reviewed changes

aokolnychyi force-pushed the fix-vectorized-reads-mor branch from 03dc5be to 09c4283 Compare September 2, 2023 03:14

aokolnychyi commented Sep 2, 2023

View reviewed changes

jerqi approved these changes Sep 4, 2023

View reviewed changes

aokolnychyi force-pushed the fix-vectorized-reads-mor branch from 09c4283 to 525baba Compare September 11, 2023 17:08

aokolnychyi commented Sep 11, 2023

View reviewed changes

flyrain approved these changes Sep 13, 2023

View reviewed changes

aokolnychyi added 2 commits September 13, 2023 18:47

Arrow, Spark 3.4: Support vectorized reads with struct constants

18afb65

Review

e3b188a

aokolnychyi force-pushed the fix-vectorized-reads-mor branch from cf0d278 to e3b188a Compare September 14, 2023 01:50

aokolnychyi merged commit bb32b90 into apache:master Sep 14, 2023
41 checks passed

aokolnychyi added this to the Iceberg 1.4.0 milestone Sep 14, 2023

aokolnychyi mentioned this pull request Sep 15, 2023

Arrow: Propagate correct field info while reading metadata columns #8568

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow, Spark 3.4: Support vectorized reads with struct constants #8466

Arrow, Spark 3.4: Support vectorized reads with struct constants #8466

aokolnychyi commented Sep 1, 2023

aokolnychyi Sep 1, 2023

aokolnychyi Sep 1, 2023

aokolnychyi Sep 1, 2023

aokolnychyi Sep 1, 2023 •

edited

Loading

aokolnychyi Sep 1, 2023 •

edited

Loading

aokolnychyi Sep 1, 2023 •

edited

Loading

aokolnychyi Sep 1, 2023

aokolnychyi Sep 1, 2023

jerqi Sep 2, 2023

aokolnychyi Sep 2, 2023

jerqi Sep 2, 2023

aokolnychyi Sep 2, 2023

aokolnychyi Sep 2, 2023

aokolnychyi Sep 11, 2023 •

edited

Loading

flyrain Sep 13, 2023

aokolnychyi Sep 14, 2023

aokolnychyi Sep 11, 2023

flyrain left a comment

flyrain Sep 13, 2023

aokolnychyi Sep 14, 2023

flyrain Sep 13, 2023

aokolnychyi Sep 14, 2023

flyrain Sep 13, 2023

aokolnychyi commented Sep 14, 2023

		@@ -59,12 +59,17 @@ public VectorHolder(

		// Only used for returning dummy holder

Arrow, Spark 3.4: Support vectorized reads with struct constants #8466

Arrow, Spark 3.4: Support vectorized reads with struct constants #8466

Conversation

aokolnychyi commented Sep 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi Sep 1, 2023 • edited Loading

Choose a reason for hiding this comment

aokolnychyi Sep 1, 2023 • edited Loading

Choose a reason for hiding this comment

aokolnychyi Sep 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi Sep 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flyrain left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi commented Sep 14, 2023

aokolnychyi Sep 1, 2023 •

edited

Loading

aokolnychyi Sep 1, 2023 •

edited

Loading

aokolnychyi Sep 1, 2023 •

edited

Loading

aokolnychyi Sep 11, 2023 •

edited

Loading