-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Java] Excessive resizing in VectorAppender leads to avoidable OversizedAllocationException #37829
Comments
Is the proposed change to make |
hrishisd
pushed a commit
to hrishisd/arrow
that referenced
this issue
Sep 24, 2023
…g variable length vectors
hrishisd
pushed a commit
to hrishisd/arrow
that referenced
this issue
Sep 24, 2023
…g variable length vectors
lidavidm
pushed a commit
that referenced
this issue
Sep 25, 2023
…able length vectors (#37844) ### Rationale for this change This change prevents avoidable `OversizedAllocationException`s when appending a variable-length vector with many small elements to a variable-length vector with a few large elements. When appending variable-length vectors, `VectorAppender` iteratively doubles the offset and validity buffers until they can accommodate the combined elements. In the previous implementation, each iteration would also double the data buffer's capacity. This behavior is appropriate for vectors of fixed-size types but can result in an oversized data buffers when appending many small elements to a variable length vector with a large data buffer. ### What changes are included in this PR? The new behavior only resizes the offset and validity buffers when resizing the target vector's buffers to ensure they can hold the total number of combined elements. The data buffer is resized based on the total required data size of the combined elements. ### Are these changes tested? Yes. I added a unit test that results in an `OversizedAllocationException` when run against the previous version of the code. ### Are there any user-facing changes? No. * Closes: #37829 Authored-by: hrishisd <[email protected]> Signed-off-by: David Li <[email protected]>
etseidl
pushed a commit
to etseidl/arrow
that referenced
this issue
Sep 28, 2023
…g variable length vectors (apache#37844) ### Rationale for this change This change prevents avoidable `OversizedAllocationException`s when appending a variable-length vector with many small elements to a variable-length vector with a few large elements. When appending variable-length vectors, `VectorAppender` iteratively doubles the offset and validity buffers until they can accommodate the combined elements. In the previous implementation, each iteration would also double the data buffer's capacity. This behavior is appropriate for vectors of fixed-size types but can result in an oversized data buffers when appending many small elements to a variable length vector with a large data buffer. ### What changes are included in this PR? The new behavior only resizes the offset and validity buffers when resizing the target vector's buffers to ensure they can hold the total number of combined elements. The data buffer is resized based on the total required data size of the combined elements. ### Are these changes tested? Yes. I added a unit test that results in an `OversizedAllocationException` when run against the previous version of the code. ### Are there any user-facing changes? No. * Closes: apache#37829 Authored-by: hrishisd <[email protected]> Signed-off-by: David Li <[email protected]>
JerAguilon
pushed a commit
to JerAguilon/arrow
that referenced
this issue
Oct 23, 2023
…g variable length vectors (apache#37844) ### Rationale for this change This change prevents avoidable `OversizedAllocationException`s when appending a variable-length vector with many small elements to a variable-length vector with a few large elements. When appending variable-length vectors, `VectorAppender` iteratively doubles the offset and validity buffers until they can accommodate the combined elements. In the previous implementation, each iteration would also double the data buffer's capacity. This behavior is appropriate for vectors of fixed-size types but can result in an oversized data buffers when appending many small elements to a variable length vector with a large data buffer. ### What changes are included in this PR? The new behavior only resizes the offset and validity buffers when resizing the target vector's buffers to ensure they can hold the total number of combined elements. The data buffer is resized based on the total required data size of the combined elements. ### Are these changes tested? Yes. I added a unit test that results in an `OversizedAllocationException` when run against the previous version of the code. ### Are there any user-facing changes? No. * Closes: apache#37829 Authored-by: hrishisd <[email protected]> Signed-off-by: David Li <[email protected]>
dongjoon-hyun
pushed a commit
to apache/spark
that referenced
this issue
Nov 4, 2023
### What changes were proposed in this pull request? This pr upgrade Apache Arrow from 13.0.0 to 14.0.0. ### Why are the changes needed? The Apache Arrow 14.0.0 release brings a number of enhancements and bug fixes. In terms of bug fixes, the release addresses several critical issues that were causing failures in integration jobs with Spark([GH-36332](apache/arrow#36332)) and problems with importing empty data arrays([GH-37056](apache/arrow#37056)). It also optimizes the process of appending variable length vectors([GH-37829](apache/arrow#37829)) and includes C++ libraries for MacOS AARCH 64 in Java-Jars([GH-38076](apache/arrow#38076)). The new features and improvements focus on enhancing the handling and manipulation of data. This includes the introduction of DefaultVectorComparators for large types([GH-25659](apache/arrow#25659)), support for extended expressions in ScannerBuilder([GH-34252](apache/arrow#34252)), and the exposure of the VectorAppender class([GH-37246](apache/arrow#37246)). The release also brings enhancements to the development and testing process, with the CI environment now using JDK 21([GH-36994](apache/arrow#36994)). In addition, the release introduces vector validation consistent with C++, ensuring consistency across different languages([GH-37702](apache/arrow#37702)). Furthermore, the usability of VarChar writers and binary writers has been improved with the addition of extra input methods([GH-37705](apache/arrow#37705)), and VarCharWriter now supports writing from `Text` and `String`([GH-37706](apache/arrow#37706)). The release also adds typed getters for StructVector, improving the ease of accessing data([GH-37863](apache/arrow#37863)). The full release notes as follows: - https://arrow.apache.org/release/14.0.0.html ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43650 from LuciferYang/arrow-14. Lead-authored-by: yangjie01 <[email protected]> Co-authored-by: YangJie <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
loicalleyne
pushed a commit
to loicalleyne/arrow
that referenced
this issue
Nov 13, 2023
…g variable length vectors (apache#37844) ### Rationale for this change This change prevents avoidable `OversizedAllocationException`s when appending a variable-length vector with many small elements to a variable-length vector with a few large elements. When appending variable-length vectors, `VectorAppender` iteratively doubles the offset and validity buffers until they can accommodate the combined elements. In the previous implementation, each iteration would also double the data buffer's capacity. This behavior is appropriate for vectors of fixed-size types but can result in an oversized data buffers when appending many small elements to a variable length vector with a large data buffer. ### What changes are included in this PR? The new behavior only resizes the offset and validity buffers when resizing the target vector's buffers to ensure they can hold the total number of combined elements. The data buffer is resized based on the total required data size of the combined elements. ### Are these changes tested? Yes. I added a unit test that results in an `OversizedAllocationException` when run against the previous version of the code. ### Are there any user-facing changes? No. * Closes: apache#37829 Authored-by: hrishisd <[email protected]> Signed-off-by: David Li <[email protected]>
dgreiss
pushed a commit
to dgreiss/arrow
that referenced
this issue
Feb 19, 2024
…g variable length vectors (apache#37844) ### Rationale for this change This change prevents avoidable `OversizedAllocationException`s when appending a variable-length vector with many small elements to a variable-length vector with a few large elements. When appending variable-length vectors, `VectorAppender` iteratively doubles the offset and validity buffers until they can accommodate the combined elements. In the previous implementation, each iteration would also double the data buffer's capacity. This behavior is appropriate for vectors of fixed-size types but can result in an oversized data buffers when appending many small elements to a variable length vector with a large data buffer. ### What changes are included in this PR? The new behavior only resizes the offset and validity buffers when resizing the target vector's buffers to ensure they can hold the total number of combined elements. The data buffer is resized based on the total required data size of the combined elements. ### Are these changes tested? Yes. I added a unit test that results in an `OversizedAllocationException` when run against the previous version of the code. ### Are there any user-facing changes? No. * Closes: apache#37829 Authored-by: hrishisd <[email protected]> Signed-off-by: David Li <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug, including details regarding any error messages, version, and platform.
Arrow version
Built from main branch as of 9/21/2023
Problem description
When appending two variable length vectors,
VectorAppender
repeatedly resizes the validity and offset buffers of the target vector until they can hold the combined elements. While doing so, it also resizes the data buffer which can cause the data buffer to exceed the max allocation limit when we append a large number of small elements to a vector with a single large element.Steps to reproduce
The error can be reproduced using the snippet below.
The example produces the following error
It looks like the issue is also present when appending other variable-length vector types.
I'm happy to post a PR if this looks reasonable.
Component(s)
Java
The text was updated successfully, but these errors were encountered: