Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow: add support for null vectors #10953

Open
wants to merge 48 commits into
base: main
Choose a base branch
from

Conversation

slessard
Copy link

@slessard slessard commented Aug 16, 2024

Add new class NullAccessor to aid reading null columns. A null column is a column that exists in the Iceberg schema, but does not exist in the parquet schema. A column would exist in Iceberg, but not in Parquet when a column is added to the Iceberg schema, but no new rows have been added to the table. See ArrowReadetTest.testReadColumnThatDoesNotExistInParquetSchema for an example.

Closes #10275

sl255051 and others added 23 commits May 7, 2024 18:21
Fix NullPointerException when trying to add the vector's class name to the message for an UnsupportedOperationException
This test more closely follows the reproduction steps described in issue apache#10275
This solution hacks in a VectorHolder instance built specifically for the missing column. Implementing this hack allowed me to explore what would be needed to support vectorized reading of null columns
@github-actions github-actions bot added the arrow label Aug 16, 2024
Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sl255051 I spent some time understanding the problem and going through the code and fixing it myself (without first looking at this PR). I think you're on the right track here to handle the root cause and we should have a NullAccessor that internally tracks a NullVector.

I would suggest to squash all the commits to a single one and then go through my comments

@sl255051
Copy link
Contributor

sl255051 commented Aug 24, 2024 via email

These unit tests, particularly `testIsDummy1` and `testIsDummy2`, exposed a bug in the code where the `isDummy` method no longer returned the expected value.
@slessard
Copy link
Author

@amogh-jahagirdar I am not planning to make any additional changes. Please review

@slessard
Copy link
Author

slessard commented Oct 7, 2024

@RussellSpitzer I'd appreciate it if you could review these changes in iceberg-arrow

Comment on lines +303 to +305
// Alter the table schema by adding a new, optional column.
// Do not add any data for this new column in the one existing row in the table
// and do not insert any new rows into the table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be confusing vectorized read paths but I'm curious why this isn't reproducible in Spark vectorized reads? Or is it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try and repro in a unit test for Spark and see if it's the case. To be clear I don't want to hold up this PR on that though since it does seem like a legitimate problem based on the test being done here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never tried reproducing the issue in Spark. My guess as to why there was no existing test case is that null vectors isn't a bug so much as it is a subfeature that was never implemented. No sense in creating a test for a subfeature that was knowingly never implemented.

What makes me say null vector support was knowingly never implemented? Look at the removed comment in arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowBatchReader.java. That comment wasn't a "this is how this code works" type of comment. In my opinion that comment is a TODO comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to explore and see how this looks in Spark's vectorized path

this.numRows = numRows;
this.constantValue = null;
}

public ConstantVectorHolder(Types.NestedField icebergField, int numRows, T constantValue) {
super(icebergField);
super(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of placing an if-statement in the constructor itself, it feels more like we should just have a different constructor or class?

Copy link
Author

@slessard slessard Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer In an earlier draft of this PR I did create a NullVectorHolder class but was told to instead adapt ConstantVectorHolder to work for my case, so I did that.

Here's a patch to add a new constructor, but it has at least one issue. Passing null as the constant value in the existing constructor is no longer allowed. This is a semantic breaking change, though not an API breaking change. What this means is that third party code that upgrades to a version of Apache Iceberg containing this patch that was previously using this constructor and passing in a null constant value will still compile fine but will have a runtime failure. That doesn't seem like an appealing solution to me.

I think adding a new class such as NullVectorHolder has its own issue. What would be the use case for a ConstantVectorHolder containing a null value versus a NullVectorHolder containing a null value?

Which solution would you prefer?

  1. ternary operator in the super call
  2. Overloaded constructor with semantics breaking change
  3. New class, NullvectorHolder, with ambiguous use case
Subject: [PATCH] Changes
---
Index: arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java b/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
--- a/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java	(revision 163ee62ce52bc9611198325997010c7e2b793c71)
+++ b/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java	(date 1728512295607)
@@ -148,16 +148,21 @@
     }
 
     public ConstantVectorHolder(Types.NestedField icebergField, int numRows, T constantValue) {
+      super(icebergField);
+      Preconditions.checkNotNull(constantValue, "Constant value cannot be null");
+      this.numRows = numRows;
+      this.constantValue = constantValue;
+    }
+
+    public ConstantVectorHolder(Types.NestedField icebergField, int numRows) {
       super(
-          (null == constantValue) ? new NullVector(icebergField.name(), numRows) : null,
-          icebergField,
-          new NullabilityHolder(numRows));
-      if (null == constantValue) {
-        nullabilityHolder().setNulls(0, numRows);
-      }
+              new NullVector(icebergField.name(), numRows),
+              icebergField,
+              new NullabilityHolder(numRows));
+      nullabilityHolder().setNulls(0, numRows);
 
       this.numRows = numRows;
-      this.constantValue = constantValue;
+      this.constantValue = null;
     }
 
     @Override

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok having gone through the code I think I may have a strategy here but please let me know what i'm missing in this interpretation.

ConstantVectorHolder is used as a dummy holder prior to this commit. It has no vector , column descriptor, etc ... everything is null. The current approach sometimes makes a ConstantVectorHolder which sometimes does have something inside of it. This is done so that GenericArrowVectorAccessorFactory will have something to work with.

From what I can see in GenericArrowVectorAccessorFactory is that it doesn't work with ConstantVectorHolders at all. It currently can only handle cases in which holder.vector() is a non null type and matches some known vector type. Spark on the other hand, does not actually use this path and when it sees a ConstantVectorHolder it instead creates a o.a.i.s.ConstantColumnVector. This is why I don't think Spark has any issue with this. (cc @amogh-jahagirdar )

Now this makes me think what we really need is to just fix GenericArrowVectorAccessorFactory to handle the case where the "vector" is null and in that circumstance attempt to cast the VectorHolder to a ConstantVectorHolder and create the appropriate vector type or accessor. I think we have a few options to do this.

I think the easiest may be to just create an accessor that looks like the Spark ConstantColumnVector and just use that as our generic accessor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The another approach is to directly mimic the Spark implementation and fork here

And if the holder is a ConstantVectorHolder set the vector to a new ConstantVector which we define similar to the Spark ConstantColumnVector code again.

@@ -140,12 +141,21 @@ public static class ConstantVectorHolder<T> extends VectorHolder {
private final int numRows;

public ConstantVectorHolder(int numRows) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we keeping this constructor? Is it needed if we can make a NullVectorHolder?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor could be deleted but that would be an API breaking change because this is a public constructor in a public class. Which would you prefer, an extra constructor or a breaking change?

this.numRows = numRows;
this.constantValue = null;
}

public ConstantVectorHolder(Types.NestedField icebergField, int numRows, T constantValue) {
super(icebergField);
super(
(null == constantValue) ? new NullVector(icebergField.name(), numRows) : null,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(null == constantValue) ? new NullVector(icebergField.name(), numRows) : null,
null == constantValue ? new NullVector(icebergField.name(), numRows) : null,

assertThat(root.getSchema()).isEqualTo(expectedSchema);

// check all vector types
assertThat(root.getVector("a").getClass()).isEqualTo(IntVector.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assertThat(root.getVector("a").getClass()).isEqualTo(IntVector.class);
assertThat(root.getVector("a")).isInstanceOf(IntVector.class);

please also update all other places in this test class

Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Nov 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NullPointerException when using VectorizedArrowReader to read a null column
5 participants