Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43266: [C#] Add LargeBinary, LargeString and LargeList array types #43269

Merged
merged 10 commits into from
Jul 19, 2024

Conversation

adamreeve
Copy link
Contributor

@adamreeve adamreeve commented Jul 16, 2024

Rationale for this change

See #43266. Note that LargeBinary and LargeString are still limited to 2 GiB buffers, and LargeList is limited to offsets that can be represented as int32.

What changes are included in this PR?

  • Add new Array subtypes: LargeBinaryArray, LargeStringArray and LargeListArray
  • Support round-tripping these array types via the IPC format
  • Support round-tripping these array types via the C Data Interface
  • Improve error messages when importing arrays that are too large via IPC or C Data Interface
  • Enable integration tests for the new types
  • Update documentation

Are these changes tested?

Yes, I've added some basic tests specifically for the new array types, and added these to the test data generator so they're covered by the existing tests for round tripping using IPC and C Data Interface.

Are there any user-facing changes?

Yes, this is a new user facing feature.

Implementation notes

  • I haven't added builders for these new array types. Given they're added to help with interoperability with other libraries, I wouldn't expect .NET users to build arrays of these types as they provide no other benefit over the non-large types until we have proper large memory support. But I'm happy to add this if it would be useful.
  • The new array types share a lot of logic with the non-large types. I considered trying to consolidate this logic by adding a new BinaryArrayBase<TOffset> class for example, but I think this would require generic math support to work nicely, and would still complicate the code quite a bit and add extra virtual method call overhead. So I think it's fine to keep these new Array subtypes independent from the non-large types.
  • I haven't included support for materializing a LargeStringArray (see GH-41047: [C#] Address performance issue of reading from StringArray #41048). I'm not sure whether there would be a use for this, but it could be added later if needed.
  • GitHub Issue: [C#] Add LargeBinary, LargeString and LargeList array types #43266


public ReadOnlySpan<long> ValueOffsets => ValueOffsetsBuffer.Span.CastTo<long>().Slice(Offset, Length + 1);

public ReadOnlySpan<byte> Values => ValueBuffer.Span.CastTo<byte>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we want to consider naming this something different like SmallValues to help with backwards compatibility when we eventually have something like a LargeReadOnlySpan<T> type.

This problem isn't specific to these new array types though. For PrimitiveArray for example we'll probably also want to introduce a new "Large" version of the Values ReadOnlySpan, so for consistency I think it's fine to keep calling this Values. Then we can later add something named like LargeValues to all applicable array types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit that I like the consistency of having the same members on both classes, but I also wonder at the value (ha ha) of exposing this at all. Someone who needs to get at the underlying buffer can already access ValueBuffer, and this span doesn't have any clear uses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, not adding this member solves the backwards compatibility problem nicely, and I don't see a great need for it either. I'll remove this.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jul 16, 2024
@@ -132,7 +132,13 @@ protected ReadResult ReadMessage()

Flatbuf.Message message = Flatbuf.Message.GetRootAsMessage(CreateByteBuffer(messageBuff));

int bodyLength = checked((int)message.BodyLength);
if (message.BodyLength > int.MaxValue)
Copy link
Contributor Author

@adamreeve adamreeve Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a need for more specific checks when importing individual arrays as this should cover all scenarios.

There's also an ArrowMemoryReaderImplementation but that needs to be constructed with a ReadOnlyMemory<byte> so is already limited to 2 GiB.

I tested reading data from Flight which uses a RecordBatchReaderImplementation, but that actually fails on the C++ server side with an error that > 2 GiB record batches are not supported (see FlightPayload::Validate), and I'm not sure whether any other languages do support > 2 GiB batches over Flight.

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @adamreeve ! this is great but Isn't the main point of using Large List and Large Binary / Large String to be able to have offsets represented with int64 and/or data buffers > 2GiB?
This will allow some compatibility with other languages but on the cases supported we should probably not use Large Binary/String/List and just use "normal" Binary/String/List.

I am missing to see the rationale of the change without supporting the main "feature" of these Large formats but probably I am just missing some context :)

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jul 16, 2024
@CurtHagenlocher
Copy link
Contributor

@raulcd The primary motivation is integration with Polars, which apparently doesn't support the non-Large versions.

@adamreeve
Copy link
Contributor Author

Yeah as Curt says this is about compatibility with other libraries, but otherwise these don't currently provide any other benefit if you're only using .NET, and you'll get errors if you try to import anything that's too large.

I'm happy to revert the documentation changes if you think they're misleading? Or maybe update the wording to add that this is only to help with interoperability?

@raulcd
Copy link
Member

raulcd commented Jul 17, 2024

Or maybe update the wording to add that this is only to help with interoperability?

Thanks, that makes sense with the context. I see the value of supporting it now. I think we could try to improve the wording as suggested as it might be misleading for users who might potentially have issues with interoperability when using large formats from other systems/languages if those don't fit the current limitations.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 18, 2024
@adamreeve
Copy link
Contributor Author

OK thanks, I've updated the status documentation so it isn't so misleading now


public ReadOnlySpan<long> ValueOffsets => ValueOffsetsBuffer.Span.CastTo<long>().Slice(Offset, Length + 1);

public ReadOnlySpan<byte> Values => ValueBuffer.Span.CastTo<byte>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit that I like the consistency of having the same members on both classes, but I also wonder at the value (ha ha) of exposing this at all. Someone who needs to get at the underlying buffer can already access ValueBuffer, and this span doesn't have any clear uses.

Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@CurtHagenlocher CurtHagenlocher merged commit 299ad70 into apache:main Jul 19, 2024
12 checks passed
@CurtHagenlocher CurtHagenlocher removed the awaiting change review Awaiting change review label Jul 19, 2024
@adamreeve adamreeve deleted the large_string branch July 19, 2024 03:55
Copy link

After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit 299ad70.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants