-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow: add support for null vectors #10953
base: main
Are you sure you want to change the base?
Conversation
Fix NullPointerException when trying to add the vector's class name to the message for an UnsupportedOperationException
This test more closely follows the reproduction steps described in issue apache#10275
…eaderTest.java Co-authored-by: Eduard Tudenhoefner <[email protected]>
…eaderTest.java Co-authored-by: Eduard Tudenhoefner <[email protected]>
This solution hacks in a VectorHolder instance built specifically for the missing column. Implementing this hack allowed me to explore what would be needed to support vectorized reading of null columns
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sl255051 I spent some time understanding the problem and going through the code and fixing it myself (without first looking at this PR). I think you're on the right track here to handle the root cause and we should have a NullAccessor
that internally tracks a NullVector
.
I would suggest to squash all the commits to a single one and then go through my comments
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java
Show resolved
Hide resolved
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java
Outdated
Show resolved
Hide resolved
...src/test/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactoryTest.java
Outdated
Show resolved
Hide resolved
Thank you, @nastra, for your help. I will be on vacation next week. I will pick this up again when I return on September 3.
|
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java
Outdated
Show resolved
Hide resolved
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java
Outdated
Show resolved
Hide resolved
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java
Outdated
Show resolved
Hide resolved
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
Outdated
Show resolved
Hide resolved
These unit tests, particularly `testIsDummy1` and `testIsDummy2`, exposed a bug in the code where the `isDummy` method no longer returned the expected value.
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
Outdated
Show resolved
Hide resolved
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/VectorHolderTest.java
Outdated
Show resolved
Hide resolved
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/VectorHolderTest.java
Outdated
Show resolved
Hide resolved
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/VectorHolderTest.java
Outdated
Show resolved
Hide resolved
@amogh-jahagirdar I am not planning to make any additional changes. Please review |
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
Outdated
Show resolved
Hide resolved
@RussellSpitzer I'd appreciate it if you could review these changes in iceberg-arrow |
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
Show resolved
Hide resolved
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java
Outdated
Show resolved
Hide resolved
// Alter the table schema by adding a new, optional column. | ||
// Do not add any data for this new column in the one existing row in the table | ||
// and do not insert any new rows into the table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be confusing vectorized read paths but I'm curious why this isn't reproducible in Spark vectorized reads? Or is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try and repro in a unit test for Spark and see if it's the case. To be clear I don't want to hold up this PR on that though since it does seem like a legitimate problem based on the test being done here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never tried reproducing the issue in Spark. My guess as to why there was no existing test case is that null vectors isn't a bug so much as it is a subfeature that was never implemented. No sense in creating a test for a subfeature that was knowingly never implemented.
What makes me say null vector support was knowingly never implemented? Look at the removed comment in arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowBatchReader.java. That comment wasn't a "this is how this code works" type of comment. In my opinion that comment is a TODO comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to explore and see how this looks in Spark's vectorized path
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java
Outdated
Show resolved
Hide resolved
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/VectorHolderTest.java
Show resolved
Hide resolved
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/VectorHolderTest.java
Show resolved
Hide resolved
this.numRows = numRows; | ||
this.constantValue = null; | ||
} | ||
|
||
public ConstantVectorHolder(Types.NestedField icebergField, int numRows, T constantValue) { | ||
super(icebergField); | ||
super( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of placing an if-statement in the constructor itself, it feels more like we should just have a different constructor or class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer In an earlier draft of this PR I did create a NullVectorHolder class but was told to instead adapt ConstantVectorHolder to work for my case, so I did that.
Here's a patch to add a new constructor, but it has at least one issue. Passing null
as the constant value in the existing constructor is no longer allowed. This is a semantic breaking change, though not an API breaking change. What this means is that third party code that upgrades to a version of Apache Iceberg containing this patch that was previously using this constructor and passing in a null constant value will still compile fine but will have a runtime failure. That doesn't seem like an appealing solution to me.
I think adding a new class such as NullVectorHolder has its own issue. What would be the use case for a ConstantVectorHolder containing a null value versus a NullVectorHolder containing a null value?
Which solution would you prefer?
- ternary operator in the super call
- Overloaded constructor with semantics breaking change
- New class, NullvectorHolder, with ambiguous use case
Subject: [PATCH] Changes
---
Index: arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java b/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
--- a/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java (revision 163ee62ce52bc9611198325997010c7e2b793c71)
+++ b/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java (date 1728512295607)
@@ -148,16 +148,21 @@
}
public ConstantVectorHolder(Types.NestedField icebergField, int numRows, T constantValue) {
+ super(icebergField);
+ Preconditions.checkNotNull(constantValue, "Constant value cannot be null");
+ this.numRows = numRows;
+ this.constantValue = constantValue;
+ }
+
+ public ConstantVectorHolder(Types.NestedField icebergField, int numRows) {
super(
- (null == constantValue) ? new NullVector(icebergField.name(), numRows) : null,
- icebergField,
- new NullabilityHolder(numRows));
- if (null == constantValue) {
- nullabilityHolder().setNulls(0, numRows);
- }
+ new NullVector(icebergField.name(), numRows),
+ icebergField,
+ new NullabilityHolder(numRows));
+ nullabilityHolder().setNulls(0, numRows);
this.numRows = numRows;
- this.constantValue = constantValue;
+ this.constantValue = null;
}
@Override
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok having gone through the code I think I may have a strategy here but please let me know what i'm missing in this interpretation.
ConstantVectorHolder is used as a dummy holder prior to this commit. It has no vector , column descriptor, etc ... everything is null. The current approach sometimes makes a ConstantVectorHolder which sometimes does have something inside of it. This is done so that GenericArrowVectorAccessorFactory will have something to work with.
From what I can see in GenericArrowVectorAccessorFactory is that it doesn't work with ConstantVectorHolders at all. It currently can only handle cases in which holder.vector() is a non null type and matches some known vector type. Spark on the other hand, does not actually use this path and when it sees a ConstantVectorHolder it instead creates a o.a.i.s.ConstantColumnVector. This is why I don't think Spark has any issue with this. (cc @amogh-jahagirdar )
Now this makes me think what we really need is to just fix GenericArrowVectorAccessorFactory to handle the case where the "vector" is null and in that circumstance attempt to cast the VectorHolder to a ConstantVectorHolder and create the appropriate vector type or accessor. I think we have a few options to do this.
I think the easiest may be to just create an accessor that looks like the Spark ConstantColumnVector and just use that as our generic accessor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The another approach is to directly mimic the Spark implementation and fork here
iceberg/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowBatchReader.java
Line 53 in c07f2aa
// Handle null vector for constant case |
And if the holder is a ConstantVectorHolder set the vector to a new ConstantVector which we define similar to the Spark ConstantColumnVector code again.
@@ -140,12 +141,21 @@ public static class ConstantVectorHolder<T> extends VectorHolder { | |||
private final int numRows; | |||
|
|||
public ConstantVectorHolder(int numRows) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we keeping this constructor? Is it needed if we can make a NullVectorHolder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constructor could be deleted but that would be an API breaking change because this is a public constructor in a public class. Which would you prefer, an extra constructor or a breaking change?
this.numRows = numRows; | ||
this.constantValue = null; | ||
} | ||
|
||
public ConstantVectorHolder(Types.NestedField icebergField, int numRows, T constantValue) { | ||
super(icebergField); | ||
super( | ||
(null == constantValue) ? new NullVector(icebergField.name(), numRows) : null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(null == constantValue) ? new NullVector(icebergField.name(), numRows) : null, | |
null == constantValue ? new NullVector(icebergField.name(), numRows) : null, |
assertThat(root.getSchema()).isEqualTo(expectedSchema); | ||
|
||
// check all vector types | ||
assertThat(root.getVector("a").getClass()).isEqualTo(IntVector.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertThat(root.getVector("a").getClass()).isEqualTo(IntVector.class); | |
assertThat(root.getVector("a")).isInstanceOf(IntVector.class); |
please also update all other places in this test class
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
Add new class NullAccessor to aid reading null columns. A null column is a column that exists in the Iceberg schema, but does not exist in the parquet schema. A column would exist in Iceberg, but not in Parquet when a column is added to the Iceberg schema, but no new rows have been added to the table. See ArrowReadetTest.testReadColumnThatDoesNotExistInParquetSchema for an example.
Closes #10275