Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-32240: [C#] Add new Apache.Arrow.Compression package to implement IPC decompression #33893

Merged
merged 11 commits into from
Feb 28, 2023

Conversation

adamreeve
Copy link
Contributor

@adamreeve adamreeve commented Jan 26, 2023

Rationale for this change

This further addresses #32240 and is a follow up to PR #33603 to provide an implementation of the ICompressionCodecFactory interface in a new Apache.Arrow.Compression package. Making this a separate package means users who don't need IPC decompression support don't need to pull in extra dependencies.

What changes are included in this PR?

Adds a new Apache.Arrow.Compression package and moves the existing compression codec implementations used for testing into this package.

Are these changes tested?

There are unit tests verifying the decompression support, but this also affects the release scripts and I'm not sure how to fully test these.

Are there any user-facing changes?

Yes, this adds a new package users can install for IPC decompression support, so documentation has been updated.

@github-actions
Copy link

</PropertyGroup>

<ItemGroup>
<PackageReference Include="CommunityToolkit.HighPerformance" Version="8.0.0" />
Copy link
Contributor Author

@adamreeve adamreeve Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CommunityToolkit.HighPerformance dependency is only needed for the AsStream extension methods on Memory<byte> and ReadOnlyMemory<byte> which could be re-implemented if we wanted to minimize extra dependencies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that the LZ4 library offers APIs that takes Spans. We should just use those APIs instead, and then we can get rid of this dependency.

https://github.com/MiloszKrajewski/K4os.Compression.LZ4/blob/fa8b8e038b500d565efe12769db097852a28ddf7/src/K4os.Compression.LZ4/LZ4Codec.cs#L139-L144

Note that there are 2 LZ4 libraries - one for streams (which this pr is using now) and one that doesn't use streams - https://www.nuget.org/packages/K4os.Compression.LZ4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two libraries actually implement different formats. The K4os.Compression.LZ4.Streams library implements the LZ4 frame format, and the K4os.Compression.LZ4 library implements the LZ4 block format. Arrow IPC uses the frame format (https://github.com/apache/arrow/blob/apache-arrow-11.0.0/format/Message.fbs#L46-L48), so we need to use the streams library.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened MiloszKrajewski/K4os.Compression.LZ4#79 to request this is added to the K4os.Compression.LZ4 library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Eric, your issue prompted me to dig more into the K4os library though and I realised that since I initially started working on this, a new API has been added that allows using the frame format with more types than just Stream, so I've switched to this and removed the CommunityToolkit.HighPerformance dependency.

@assignUser
Copy link
Member

@github-actions crossbow submit nuget verify-rc-source-csharp*

@assignUser
Copy link
Member

@adamreeve I have kicked of some additional CI jobs that should be sufficent to test the changes to the ci side of things and building the nuget packages. The release script is tested with ci/scripts/release_test.sh but the test themselves are implemented in Ruby.

@kou can you give details on if additional tests are needed for this PR and if so how to implement them?

@github-actions
Copy link

Revision: 72f1c16

Submitted crossbow builds: ursacomputing/crossbow @ actions-c460b32b4b

Task Status
nuget Github Actions
verify-rc-source-csharp-linux-almalinux-8-amd64 Github Actions
verify-rc-source-csharp-linux-conda-latest-amd64 Github Actions
verify-rc-source-csharp-linux-ubuntu-18.04-amd64 Github Actions
verify-rc-source-csharp-linux-ubuntu-20.04-amd64 Github Actions
verify-rc-source-csharp-linux-ubuntu-22.04-amd64 Github Actions
verify-rc-source-csharp-macos-amd64 Github Actions
verify-rc-source-csharp-macos-arm64 Github Actions

@assignUser
Copy link
Member

Copy link
Member

@assignUser assignUser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI side looks good, cant comment on the C# +1

@kou
Copy link
Member

kou commented Jan 27, 2023

@github-actions crossbow submit nuget

@kou kou requested a review from eerhardt January 27, 2023 02:31
@kou
Copy link
Member

kou commented Jan 27, 2023

@eerhardt Could you review this?

@kou
Copy link
Member

kou commented Jan 27, 2023

We don't have a test for post-06-csharp.sh and don't need it. Because it just pushes built packages to NuGet.

@github-actions
Copy link

Revision: 6b5145e

Submitted crossbow builds: ursacomputing/crossbow @ actions-2b874cf487

Task Status
nuget Github Actions

</PropertyGroup>

<ItemGroup>
<PackageReference Include="CommunityToolkit.HighPerformance" Version="8.0.0" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine.

</PropertyGroup>

<ItemGroup>
<PackageReference Include="CommunityToolkit.HighPerformance" Version="8.0.0" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that the LZ4 library offers APIs that takes Spans. We should just use those APIs instead, and then we can get rid of this dependency.

https://github.com/MiloszKrajewski/K4os.Compression.LZ4/blob/fa8b8e038b500d565efe12769db097852a28ddf7/src/K4os.Compression.LZ4/LZ4Codec.cs#L139-L144

Note that there are 2 LZ4 libraries - one for streams (which this pr is using now) and one that doesn't use streams - https://www.nuget.org/packages/K4os.Compression.LZ4

Copy link
Contributor

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple last nits. I think this can be merged after they are addressed.

Thanks for the great work here, @adamreeve!

Copy link
Contributor

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks again!

I'll merge this soon, assuming no more feedback is submitted.

@eerhardt eerhardt merged commit 6776229 into apache:main Feb 28, 2023
@ursabot
Copy link

ursabot commented Mar 2, 2023

Benchmark runs are scheduled for baseline = cb63068 and contender = 6776229. 6776229 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.15% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.26% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.44% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 67762295 ec2-t3-xlarge-us-east-2
[Finished] 67762295 test-mac-arm
[Finished] 67762295 ursa-i9-9960x
[Finished] 67762295 ursa-thinkcentre-m75q
[Finished] cb630686 ec2-t3-xlarge-us-east-2
[Failed] cb630686 test-mac-arm
[Finished] cb630686 ursa-i9-9960x
[Finished] cb630686 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@adamreeve adamreeve deleted the dotnet_compression_impl branch January 29, 2024 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C#] Add decompression support for Record Batches
6 participants