-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-40038: [Java] Export non empty offset buffer for variable-size layout through C Data Interface #40043
Conversation
…e-size layout should not be empty
|
if (conflictPolicy == ConflictPolicy.CONFLICT_REPLACE && vectors.containsKey(childName)) { | ||
vectors.getAll(childName).forEach(c -> c.close()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This address test failure in TestStructVector
. When duplicate field names exist in initializeChildrenFromFields
call, the initial offset buffer in previous vectors will cause memory leak. So these duplicate vectors are closed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be done in putVector
instead?
if (!init) { | ||
offsetAllocationSizeInBytes = curSize; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to avoid changing offsetAllocationSizeInBytes
to the initial 4 byte size. Otherwise, reallocOffsetBuffer
will take it and cause some tests failed.
@@ -50,6 +50,7 @@ public void testZeroRowResultSet() throws Exception { | |||
assertNotNull("VectorSchemaRoot from first next() result should never be null", root); | |||
assertEquals("VectorSchemaRoot from empty ResultSet should have zero rows", 0, root.getRowCount()); | |||
assertFalse("hasNext() should return false on empty ResultSets after initial next() call", iter.hasNext()); | |||
root.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed to release initial offset buffer for empty array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can use a try
instead of calling close
explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've run vector tests locally and they are passed. The tests in other modules might still have failure. If any, I will look at them later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java is correct here.
https://lists.apache.org/thread/w7g1zfqrjxx0bvrct0mt5zwxvdnc9nob
Ah, okay, I see. I've not found the ticket. Looks like both cases (empty or single zero element) are acceptable. I will close this. Thanks for the info. |
Are you using arrow2? It's possible it was never updated to fix this, but arrow-rs appears to have been fixed |
I'm using arrow-rs. Although its array data allows empty offset, the issue happens in ffi module which doesn't allow empty offsets for now. I proposed a fix there. |
Hmm. While we allow that for IPC, I think I'm wrong about C Data: we explicitly specify this is not allowable https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowArray.buffers @pitrou presumably Java should actually be fixed here, but what do other implementations do/we appear to be missing tests for this case? |
Isn't it allowed for second situation? |
Ah, fair. It reads as a bit ambiguous to me: would the size of the buffer be 0, or is it that it should not be 0, but we allowed 0 as an exception before? I guess both interpretations lead to the buffer being 0-sized, though, and so it applies. Maybe we should spell out the case... Regardless, it does seem we're missing an integration test then. |
Yes, I think it should be fixed. Zero-length columns are tested in the integration test, but whether they actually miss an offsets buffer depends on how the JSON reader behaves. |
Sorry, this is for the C Data integration tests |
Ah ok, now I see what you mean |
@@ -50,6 +50,7 @@ public void testZeroRowResultSet() throws Exception { | |||
assertNotNull("VectorSchemaRoot from first next() result should never be null", root); | |||
assertEquals("VectorSchemaRoot from empty ResultSet should have zero rows", 0, root.getRowCount()); | |||
assertFalse("hasNext() should return false on empty ResultSets after initial next() call", iter.hasNext()); | |||
root.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can use a try
instead of calling close
explicitly?
java/flight/flight-core/src/test/java/org/apache/arrow/flight/TestDictionaryUtils.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/BaseLargeVariableWidthVector.java
Outdated
Show resolved
Hide resolved
if (conflictPolicy == ConflictPolicy.CONFLICT_REPLACE && vectors.containsKey(childName)) { | ||
vectors.getAll(childName).forEach(c -> c.close()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be done in putVector
instead?
It seems this fix might be a bit heavy-handed if it adds a heap allocation for every empty var-width vector? AFAIU, Arrow Java uses empty vectors quite liberally. |
I think it should mean that the buffer size is 0, the buffer pointers may be null, for the second situation. |
Yes, but it's the buffer size, not the array size. For an empty array, the buffer size should be 4 (or 8 for a large offsets type), so the buffer can't be null. |
Yea, and I believe that @lidavidm meant that these empty offsets for empty var-width vector are valid. That's why I closed this and proposed a fix at ffi module at Rust instead. |
That's what I thought at the beginning and the reason I proposed this fix. But somehow from the context @lidavidm provided: I think he means that it is a valid case (empty offset). |
I guess that this might be something we really need to fix (at least for the C data export part for var-size arrays). Reopened it. |
570cdca
to
aca7403
Compare
aca7403
to
0d004c4
Compare
cc @sunchao for another eyes on this. Thanks. |
java/vector/src/main/java/org/apache/arrow/vector/BaseLargeVariableWidthVector.java
Show resolved
Hide resolved
Thanks for the review @lidavidm |
Thank you @lidavidm @pitrou @vibhatha @andygrove for the review |
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 5ddef63. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 11 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…ze layout through C Data Interface (apache#40043) ### Rationale for this change We encountered an error when exchanging string array from Java to Rust through Arrow C data interface. At Rust side, it complains that the buffer at position 1 (offset buffer) is null. After tracing down and some debugging, it looks like the issue is Java Arrow `BaseVariableWidthVector` class assigns an empty offset buffer if the array is empty (value count 0). According to Arrow [spec](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) for variable size binary layout: > The offsets buffer contains length + 1 signed integers ... So for an empty string array, its offset buffer should be a buffer with one element (generally it is `0`). ### What changes are included in this PR? This patch replaces current empty offset buffer in variable-size layout vector classes when exporting arrays through C Data Interface. ### Are these changes tested? Added test cases. ### Are there any user-facing changes? No * Closes: apache#40038 Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: David Li <[email protected]>
…ze layout through C Data Interface (apache#40043) ### Rationale for this change We encountered an error when exchanging string array from Java to Rust through Arrow C data interface. At Rust side, it complains that the buffer at position 1 (offset buffer) is null. After tracing down and some debugging, it looks like the issue is Java Arrow `BaseVariableWidthVector` class assigns an empty offset buffer if the array is empty (value count 0). According to Arrow [spec](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) for variable size binary layout: > The offsets buffer contains length + 1 signed integers ... So for an empty string array, its offset buffer should be a buffer with one element (generally it is `0`). ### What changes are included in this PR? This patch replaces current empty offset buffer in variable-size layout vector classes when exporting arrays through C Data Interface. ### Are these changes tested? Added test cases. ### Are there any user-facing changes? No * Closes: apache#40038 Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: David Li <[email protected]>
…ze layout through C Data Interface (apache#40043) ### Rationale for this change We encountered an error when exchanging string array from Java to Rust through Arrow C data interface. At Rust side, it complains that the buffer at position 1 (offset buffer) is null. After tracing down and some debugging, it looks like the issue is Java Arrow `BaseVariableWidthVector` class assigns an empty offset buffer if the array is empty (value count 0). According to Arrow [spec](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) for variable size binary layout: > The offsets buffer contains length + 1 signed integers ... So for an empty string array, its offset buffer should be a buffer with one element (generally it is `0`). ### What changes are included in this PR? This patch replaces current empty offset buffer in variable-size layout vector classes when exporting arrays through C Data Interface. ### Are these changes tested? Added test cases. ### Are there any user-facing changes? No * Closes: apache#40038 Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: David Li <[email protected]>
…ze layout through C Data Interface (apache#40043) ### Rationale for this change We encountered an error when exchanging string array from Java to Rust through Arrow C data interface. At Rust side, it complains that the buffer at position 1 (offset buffer) is null. After tracing down and some debugging, it looks like the issue is Java Arrow `BaseVariableWidthVector` class assigns an empty offset buffer if the array is empty (value count 0). According to Arrow [spec](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) for variable size binary layout: > The offsets buffer contains length + 1 signed integers ... So for an empty string array, its offset buffer should be a buffer with one element (generally it is `0`). ### What changes are included in this PR? This patch replaces current empty offset buffer in variable-size layout vector classes when exporting arrays through C Data Interface. ### Are these changes tested? Added test cases. ### Are there any user-facing changes? No * Closes: apache#40038 Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: David Li <[email protected]>
…ze layout through C Data Interface (apache#40043) ### Rationale for this change We encountered an error when exchanging string array from Java to Rust through Arrow C data interface. At Rust side, it complains that the buffer at position 1 (offset buffer) is null. After tracing down and some debugging, it looks like the issue is Java Arrow `BaseVariableWidthVector` class assigns an empty offset buffer if the array is empty (value count 0). According to Arrow [spec](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) for variable size binary layout: > The offsets buffer contains length + 1 signed integers ... So for an empty string array, its offset buffer should be a buffer with one element (generally it is `0`). ### What changes are included in this PR? This patch replaces current empty offset buffer in variable-size layout vector classes when exporting arrays through C Data Interface. ### Are these changes tested? Added test cases. ### Are there any user-facing changes? No * Closes: apache#40038 Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: David Li <[email protected]>
Rationale for this change
We encountered an error when exchanging string array from Java to Rust through Arrow C data interface. At Rust side, it complains that the buffer at position 1 (offset buffer) is null. After tracing down and some debugging, it looks like the issue is Java Arrow
BaseVariableWidthVector
class assigns an empty offset buffer if the array is empty (value count 0).According to Arrow spec for variable size binary layout:
So for an empty string array, its offset buffer should be a buffer with one element (generally it is
0
).What changes are included in this PR?
This patch replaces current empty offset buffer in variable-size layout vector classes when exporting arrays through C Data Interface.
Are these changes tested?
Added test cases.
Are there any user-facing changes?
No