-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43266: [C#] Add LargeBinary, LargeString and LargeList array types #43269
Conversation
|
||
public ReadOnlySpan<long> ValueOffsets => ValueOffsetsBuffer.Span.CastTo<long>().Slice(Offset, Length + 1); | ||
|
||
public ReadOnlySpan<byte> Values => ValueBuffer.Span.CastTo<byte>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we want to consider naming this something different like SmallValues
to help with backwards compatibility when we eventually have something like a LargeReadOnlySpan<T>
type.
This problem isn't specific to these new array types though. For PrimitiveArray
for example we'll probably also want to introduce a new "Large" version of the Values
ReadOnlySpan, so for consistency I think it's fine to keep calling this Values
. Then we can later add something named like LargeValues
to all applicable array types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to admit that I like the consistency of having the same members on both classes, but I also wonder at the value (ha ha) of exposing this at all. Someone who needs to get at the underlying buffer can already access ValueBuffer, and this span doesn't have any clear uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, not adding this member solves the backwards compatibility problem nicely, and I don't see a great need for it either. I'll remove this.
@@ -132,7 +132,13 @@ protected ReadResult ReadMessage() | |||
|
|||
Flatbuf.Message message = Flatbuf.Message.GetRootAsMessage(CreateByteBuffer(messageBuff)); | |||
|
|||
int bodyLength = checked((int)message.BodyLength); | |||
if (message.BodyLength > int.MaxValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's a need for more specific checks when importing individual arrays as this should cover all scenarios.
There's also an ArrowMemoryReaderImplementation
but that needs to be constructed with a ReadOnlyMemory<byte>
so is already limited to 2 GiB.
I tested reading data from Flight which uses a RecordBatchReaderImplementation
, but that actually fails on the C++ server side with an error that > 2 GiB record batches are not supported (see FlightPayload::Validate), and I'm not sure whether any other languages do support > 2 GiB batches over Flight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @adamreeve ! this is great but Isn't the main point of using Large List and Large Binary / Large String to be able to have offsets represented with int64 and/or data buffers > 2GiB?
This will allow some compatibility with other languages but on the cases supported we should probably not use Large Binary/String/List and just use "normal" Binary/String/List.
I am missing to see the rationale of the change without supporting the main "feature" of these Large formats but probably I am just missing some context :)
@raulcd The primary motivation is integration with Polars, which apparently doesn't support the non-Large versions. |
Yeah as Curt says this is about compatibility with other libraries, but otherwise these don't currently provide any other benefit if you're only using .NET, and you'll get errors if you try to import anything that's too large. I'm happy to revert the documentation changes if you think they're misleading? Or maybe update the wording to add that this is only to help with interoperability? |
Thanks, that makes sense with the context. I see the value of supporting it now. I think we could try to improve the wording as suggested as it might be misleading for users who might potentially have issues with interoperability when using large formats from other systems/languages if those don't fit the current limitations. |
OK thanks, I've updated the status documentation so it isn't so misleading now |
|
||
public ReadOnlySpan<long> ValueOffsets => ValueOffsetsBuffer.Span.CastTo<long>().Slice(Offset, Length + 1); | ||
|
||
public ReadOnlySpan<byte> Values => ValueBuffer.Span.CastTo<byte>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to admit that I like the consistency of having the same members on both classes, but I also wonder at the value (ha ha) of exposing this at all. Someone who needs to get at the underlying buffer can already access ValueBuffer, and this span doesn't have any clear uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit 299ad70. None of the specified runs were found on the Conbench server. The full Conbench report has more details. |
Rationale for this change
See #43266. Note that LargeBinary and LargeString are still limited to 2 GiB buffers, and LargeList is limited to offsets that can be represented as int32.
What changes are included in this PR?
Are these changes tested?
Yes, I've added some basic tests specifically for the new array types, and added these to the test data generator so they're covered by the existing tests for round tripping using IPC and C Data Interface.
Are there any user-facing changes?
Yes, this is a new user facing feature.
Implementation notes
BinaryArrayBase<TOffset>
class for example, but I think this would require generic math support to work nicely, and would still complicate the code quite a bit and add extra virtual method call overhead. So I think it's fine to keep these new Array subtypes independent from the non-large types.