GH-43266: [C#] Add LargeBinary, LargeString and LargeList array types #43269

adamreeve · 2024-07-16T04:52:10Z

Rationale for this change

See #43266. Note that LargeBinary and LargeString are still limited to 2 GiB buffers, and LargeList is limited to offsets that can be represented as int32.

What changes are included in this PR?

Add new Array subtypes: LargeBinaryArray, LargeStringArray and LargeListArray
Support round-tripping these array types via the IPC format
Support round-tripping these array types via the C Data Interface
Improve error messages when importing arrays that are too large via IPC or C Data Interface
Enable integration tests for the new types
Update documentation

Are these changes tested?

Yes, I've added some basic tests specifically for the new array types, and added these to the test data generator so they're covered by the existing tests for round tripping using IPC and C Data Interface.

Are there any user-facing changes?

Yes, this is a new user facing feature.

Implementation notes

I haven't added builders for these new array types. Given they're added to help with interoperability with other libraries, I wouldn't expect .NET users to build arrays of these types as they provide no other benefit over the non-large types until we have proper large memory support. But I'm happy to add this if it would be useful.
The new array types share a lot of logic with the non-large types. I considered trying to consolidate this logic by adding a new BinaryArrayBase<TOffset> class for example, but I think this would require generic math support to work nicely, and would still complicate the code quite a bit and add extra virtual method call overhead. So I think it's fine to keep these new Array subtypes independent from the non-large types.
I haven't included support for materializing a LargeStringArray (see GH-41047: [C#] Address performance issue of reading from StringArray #41048). I'm not sure whether there would be a use for this, but it could be added later if needed.
GitHub Issue: [C#] Add LargeBinary, LargeString and LargeList array types #43266

…rrays

adamreeve · 2024-07-16T05:05:40Z

csharp/src/Apache.Arrow/Arrays/LargeBinaryArray.cs

+
+    public ReadOnlySpan<long> ValueOffsets => ValueOffsetsBuffer.Span.CastTo<long>().Slice(Offset, Length + 1);
+
+    public ReadOnlySpan<byte> Values => ValueBuffer.Span.CastTo<byte>();


I'm not sure if we want to consider naming this something different like SmallValues to help with backwards compatibility when we eventually have something like a LargeReadOnlySpan<T> type.

This problem isn't specific to these new array types though. For PrimitiveArray for example we'll probably also want to introduce a new "Large" version of the Values ReadOnlySpan, so for consistency I think it's fine to keep calling this Values. Then we can later add something named like LargeValues to all applicable array types.

I have to admit that I like the consistency of having the same members on both classes, but I also wonder at the value (ha ha) of exposing this at all. Someone who needs to get at the underlying buffer can already access ValueBuffer, and this span doesn't have any clear uses.

Good point, not adding this member solves the backwards compatibility problem nicely, and I don't see a great need for it either. I'll remove this.

adamreeve · 2024-07-16T05:18:07Z

csharp/src/Apache.Arrow/Ipc/ArrowStreamReaderImplementation.cs

@@ -132,7 +132,13 @@ protected ReadResult ReadMessage()

                Flatbuf.Message message = Flatbuf.Message.GetRootAsMessage(CreateByteBuffer(messageBuff));

-                int bodyLength = checked((int)message.BodyLength);
+                if (message.BodyLength > int.MaxValue)


I don't think there's a need for more specific checks when importing individual arrays as this should cover all scenarios.

There's also an ArrowMemoryReaderImplementation but that needs to be constructed with a ReadOnlyMemory<byte> so is already limited to 2 GiB.

I tested reading data from Flight which uses a RecordBatchReaderImplementation, but that actually fails on the C++ server side with an error that > 2 GiB record batches are not supported (see FlightPayload::Validate), and I'm not sure whether any other languages do support > 2 GiB batches over Flight.

raulcd

hey @adamreeve ! this is great but Isn't the main point of using Large List and Large Binary / Large String to be able to have offsets represented with int64 and/or data buffers > 2GiB?
This will allow some compatibility with other languages but on the cases supported we should probably not use Large Binary/String/List and just use "normal" Binary/String/List.

I am missing to see the rationale of the change without supporting the main "feature" of these Large formats but probably I am just missing some context :)

CurtHagenlocher · 2024-07-16T11:47:18Z

@raulcd The primary motivation is integration with Polars, which apparently doesn't support the non-Large versions.

adamreeve · 2024-07-16T21:15:56Z

Yeah as Curt says this is about compatibility with other libraries, but otherwise these don't currently provide any other benefit if you're only using .NET, and you'll get errors if you try to import anything that's too large.

I'm happy to revert the documentation changes if you think they're misleading? Or maybe update the wording to add that this is only to help with interoperability?

raulcd · 2024-07-17T08:56:48Z

Or maybe update the wording to add that this is only to help with interoperability?

Thanks, that makes sense with the context. I see the value of supporting it now. I think we could try to improve the wording as suggested as it might be misleading for users who might potentially have issues with interoperability when using large formats from other systems/languages if those don't fit the current limitations.

adamreeve · 2024-07-18T00:38:00Z

OK thanks, I've updated the status documentation so it isn't so misleading now

CurtHagenlocher · 2024-07-18T03:24:57Z

csharp/src/Apache.Arrow/Arrays/LargeBinaryArray.cs

+
+    public ReadOnlySpan<long> ValueOffsets => ValueOffsetsBuffer.Span.CastTo<long>().Slice(Offset, Length + 1);
+
+    public ReadOnlySpan<byte> Values => ValueBuffer.Span.CastTo<byte>();


I have to admit that I like the consistency of having the same members on both classes, but I also wonder at the value (ha ha) of exposing this at all. Someone who needs to get at the underlying buffer can already access ValueBuffer, and this span doesn't have any clear uses.

CurtHagenlocher

Thanks!

conbench-apache-arrow · 2024-07-19T04:16:33Z

After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit 299ad70.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

adamreeve added 7 commits July 16, 2024 16:06

Add LargeString, LargeBinary and LargeList array types for .NET

1280dcc

Add test data generation for LargeString, LargeBinary and LargeList a…

d170f5a

…rrays

Support round tripping LargeString, LargeBinary and LargeList via IPC

abb1d84

Add more helpful exception when IPC message is too large

269871a

Support round tripping large array types via C Data Interface

527a3d5

Better error messages when trying to import arrays that are too large

e0ea4a8

Enable integration tests for new array types

33ce3b2

adamreeve requested review from assignUser, kou, raulcd and CurtHagenlocher as code owners July 16, 2024 04:52

github-actions bot added Component: C# Component: Documentation awaiting review Awaiting review labels Jul 16, 2024

adamreeve commented Jul 16, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jul 16, 2024

adamreeve commented Jul 16, 2024

View reviewed changes

Update support status documentation

99ba9a4

adamreeve force-pushed the large_string branch from 0eaa57f to 99ba9a4 Compare July 16, 2024 05:23

raulcd reviewed Jul 16, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jul 16, 2024

Clarify support status of large array types

ac39793

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 18, 2024

CurtHagenlocher approved these changes Jul 18, 2024

View reviewed changes

Remove unnecessary LargeBinaryArray.Values property

2990b8b

CurtHagenlocher approved these changes Jul 18, 2024

View reviewed changes

CurtHagenlocher merged commit 299ad70 into apache:main Jul 19, 2024
12 checks passed

CurtHagenlocher removed the awaiting change review Awaiting change review label Jul 19, 2024

CurtHagenlocher mentioned this pull request Jul 19, 2024

[C#] Add LargeBinary, LargeString and LargeList array types #43266

Closed

adamreeve deleted the large_string branch July 19, 2024 03:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-43266: [C#] Add LargeBinary, LargeString and LargeList array types #43269

GH-43266: [C#] Add LargeBinary, LargeString and LargeList array types #43269

adamreeve commented Jul 16, 2024 •

edited

Loading

adamreeve Jul 16, 2024

CurtHagenlocher Jul 18, 2024

adamreeve Jul 18, 2024

adamreeve Jul 16, 2024 •

edited

Loading

raulcd left a comment •

edited

Loading

CurtHagenlocher commented Jul 16, 2024

adamreeve commented Jul 16, 2024

raulcd commented Jul 17, 2024

adamreeve commented Jul 18, 2024

CurtHagenlocher Jul 18, 2024

CurtHagenlocher left a comment

conbench-apache-arrow bot commented Jul 19, 2024


		public ReadOnlySpan<long> ValueOffsets => ValueOffsetsBuffer.Span.CastTo<long>().Slice(Offset, Length + 1);

		public ReadOnlySpan<byte> Values => ValueBuffer.Span.CastTo<byte>();

GH-43266: [C#] Add LargeBinary, LargeString and LargeList array types #43269

GH-43266: [C#] Add LargeBinary, LargeString and LargeList array types #43269

Conversation

adamreeve commented Jul 16, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Implementation notes

adamreeve Jul 16, 2024

Choose a reason for hiding this comment

CurtHagenlocher Jul 18, 2024

Choose a reason for hiding this comment

adamreeve Jul 18, 2024

Choose a reason for hiding this comment

adamreeve Jul 16, 2024 • edited Loading

Choose a reason for hiding this comment

raulcd left a comment • edited Loading

Choose a reason for hiding this comment

CurtHagenlocher commented Jul 16, 2024

adamreeve commented Jul 16, 2024

raulcd commented Jul 17, 2024

adamreeve commented Jul 18, 2024

CurtHagenlocher Jul 18, 2024

Choose a reason for hiding this comment

CurtHagenlocher left a comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Jul 19, 2024

adamreeve commented Jul 16, 2024 •

edited

Loading

adamreeve Jul 16, 2024 •

edited

Loading

raulcd left a comment •

edited

Loading