[C#] Add decompression support for Record Batches #32240

asfimport · 2022-06-28T12:41:36Z

C# Implementation does not support reading batches written in other implementations of Arrow when the compression is specified in IPC Write options.

e.g. Reading this batch from pyarrow in C# will fail:

pyarrow.ipc.RecordStreamBatchWriter(sink, schema, options=pyarrow,ipcWriteOptions(compression="lz4"))

This is to support decompression (lz4 & zstd) in the C# implementation.

Reporter: Rishabh Rana / @rishabh-rana

Related issues:

[C#] Support compression in IPC format (is duplicated by)

_{Note: This issue was originally created as ARROW-16921. Please see the migration documentation for further details.}

asfimport · 2022-06-28T12:47:44Z

Rishabh Rana / @rishabh-rana:
@eerhardt @HashidaTKS I have a small query for this change. I would be grateful if you can answer/ point me to the correct person to help me with this:

Are we allowed to add a dependency to Arrow for an lz4 compression/ decompression library as part of this change? If not can you suggest another way forward?

asfimport · 2022-06-28T15:24:12Z

Eric Erhardt / @eerhardt:
One option would be to add an extension point into the reader/writer where a compression library could be "plugged in". That way the core Arrow library doesn't need to depend on all the different compression libraries, especially when they might not be used.

asfimport · 2022-07-06T08:11:13Z

Rishabh Rana / @rishabh-rana:
@eerhardt I looked at the possible options and the least intrusive one I could find is exposing an interface to the user for the library. They must write a "Decode" class based on that interface and pass that to the reader. This way users can use any library they want to decode the body.

example:

interface Decoder {
  public void Lz4Decode(Stream input, Stream output);
  public void ZstdDecode(Stream input, Stream output);
}

// usage
public class LZ4Decoder: Decoder {
  public void Lz4Decode(Stream input, Stream output) {
    LZ4Stream.Decode(input).CopyTo(output);
  }
 public void Lz4Decode(Stream input, Stream output) {
    throw new NotImplementedException();
  }
}

ArrowStreamReader(stream, new LZ4Decoder());

Does this sound good to you?

Apologies for confirming again but this would impact quite a bit of code, So I wanted to make sure that I follow the best practices. Thanks!

asfimport · 2022-07-21T14:36:44Z

Rishabh Rana / @rishabh-rana:
@eerhardt pinging you for the query above. Do you think this approach looks good?

asfimport · 2022-08-25T00:23:44Z

Weston Pace / @westonpace:
@rishabh-rana I think this approach could be slightly improved. What you have described would require the reader to know ahead of time what compression method was used on the IPC file. Could you perhaps do something like...


interface Decoder {
  public void Decode(Stream input, Stream output);
}
interface CompressionProvider {
  public Decoder GetDecoder(Apache.Arrow.Flatbuf.CompressionType compressionType);
}
public class LZ4Decoder : Decoder {
  public void Lz4Decode(Stream input, Stream output) {
    LZ4Stream.Decode(input).CopyTo(output);
  }
}
public class LZ4OnlyCompressionProvider : CompressionProvider {

  private readonly LZ4Decoder _lz4Decoder = new LZ4Decoder();

  public Decoder GetDecoder(Apache.Arrow.Flatbuf.CompressionType compressionType) {
    if (compressionType == CompressionType.LZ4_FRAME) {
      return _lz4Decoder;
    } else {
      throw new NotImplementedException();
    }
  }
}

ArrowStreamReader(stream, new LZ4OnlyCompressionProvider());

This opens the door for a future where someone could write a compression provider that supported both LZ4 and gzip (I think that gzip is part of System.IO.Compression and so should be very easy to support).

asfimport · 2022-08-25T00:33:47Z

Weston Pace / @westonpace:
Also, it might be nice if there were a standard compression provider which auto-detected the presence of common compression packages and added support automatically if the package was installed. That way the user wouldn't have to do anything except make sure they had the appropriate compression package installed.

For example, you could use Type.GetType to check if System.IO.Compression.GZipStream exists and, if it does, add a GZipDecoder automatically. However, this would require reflection to create the GZipStream. Maybe someone can think of a more clever solution.

asfimport · 2022-09-07T22:04:14Z

Adam Reeve / @adamreeve:
Hi, we're interested in this feature at G-Research and believe compression support is important for use at scale. We're keen to help out where we can. I agree that it would be nice if there was a way for compression support to be added more automatically, without users needing to implement decoders themselves. An alternative approach that builds on the Type.GetType idea would be to provide wrapper packages for each compression format used in the IPC format (currently only Zstd and LZ4 I believe), and these could provide implementations of an IDecoder interface defined in the main dotnet Arrow library. So instead of getting the LZ4Stream type with Type.GetType for example, we could do something like this to work with the IDecoder interface without needing to use reflection:

var lz4DecoderType = Type.GetType("Apache.Arrow.Compression.Lz4.Lz4Decoder, Apache.Arrow.Compression.Lz4", false);
if (lz4DecoderType != null)
{
    if (Activator.CreateInstance(lz4DecoderType) is IDecoder decoder)
    {
        // use decoder
    }
    else
    {
        throw new Exception("Failed to cast Lz4Decoder to IDecoder");
    }
}

Having to maintain these extra wrapper packages would be a bit more work from an operational point of view though. It does seem like just adding new dependencies directly to the Arrow package would be a lot more straightforward, and given there are only two compression formats currently used is this really a problem?

On a more minor point, would Decompressor be a more precise term to use rather than Decoder? At least in the Parquet world, which I'm a bit more familiar with, encodings are a separate concept to compression formats.

asfimport · 2022-09-08T02:57:55Z

Rishabh Rana / @rishabh-rana:
Thanks for your responses @adamreeve and @westonpace.

Personally, I am more inclined towards the approach that Eric and Weston have proposed. To avoid any unnecessary dependencies on Arrow. I will try to put the code together for the issue soon for a more fruitful discussion on the PR itself.

Thanks!

asfimport · 2022-09-15T04:40:35Z

Adam Reeve / @adamreeve:
Hi @rishabh-rana, I wanted to get some proof-of-concept code working using decompression so I've implemented basic Zstd decompression support in my Arrow fork: master...adamreeve:arrow:dotnet_decompression

It isn't in a state suitable for a pull request and just adds ZstdNet as a dependency of Apache.Arrow, which required bumping the target framework to netstandard2.0, but it might still be useful to you as a starting point, depending on how far you've already got with this.

asfimport · 2022-12-14T17:52:06Z

Apache Arrow JIRA Bot:
This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per project policy. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

asfimport · 2022-12-14T22:01:41Z

Adam Reeve / @adamreeve:
Hi @rishabh-rana, are you still interested in working on this? If not I could look at adapting my code to work with the API discussed here.

adamreeve · 2023-01-11T02:59:06Z

I went ahead and tidied up my code and used the reflection approach suggested by @westonpace to implement a default compression provider. I've created a PR for this at #33603

I'm still not sure the reflection approach is the best way to go, it's pretty verbose and not very discoverable by users. I've included some comments in that PR about alternative approaches but maybe it's better to discuss them here rather than on the PR, so I'll copy them below:

An alternative approach that could be considered instead of reflection is to use these extra dependencies as build time dependencies but not make them dependencies of the NuGet package. I tested this out in adamreeve@4544afd and it seems to work reasonably well too but required bumping the version of System.Runtime.CompilerServices.Unsafe under the netstandard2.0 and netcoreapp3.1 targets. This reduces all the reflection boilerplate but seems pretty hacky as now Apache.Arrow.dll depends on these extra dlls and we rely on the dotnet runtime behaviour of not trying to load them until they're used. So I think the reflection approach is better.

Another alternative would be to move decompression support into a separate NuGet package (eg. Apache.Arrow.Compression) that depends on Apache.Arrow and has an implementation of ICompressionProvider that users can pass in to the ArrowFileReader constructor, or maybe has a way to register itself with the Apache.Arrow package so it only needs to be configured once. That would seem cleaner to me but I'm not sure how much work it would be to set up a whole new package.

PS. ~~It wasn't very obvious that this issue had been migrated to GitHub, maybe comments should be added to open Jira issues to explain that comments are disabled and point to the migrated issue on GitHub?~~ Looks like I just had bad timing, there's a comment there now

westonpace · 2023-01-11T17:10:06Z

Another alternative would be to move decompression support into a separate NuGet package (eg. Apache.Arrow.Compression) that depends on Apache.Arrow and has an implementation of ICompressionProvider that users can pass in to the ArrowFileReader constructor, or maybe has a way to register itself with the Apache.Arrow package so it only needs to be configured once. That would seem cleaner to me but I'm not sure how much work it would be to set up a whole new package.

This sounds like a good approach to me. I think having a standard package that just picks some common compression libs is going to be sufficient for 80% of use cases.

adamreeve · 2023-01-11T20:08:00Z

Cool okay, @lidavidm mentioned on the PR that the Java library works that way too so I'll take a look at the API used there to see if I can make the dotnet API consistent with the way the Java API works, and pare the PR back to remove the reflection based implementation for now. It might make sense to add the new package in a separate PR.

westonpace · 2023-01-11T23:13:40Z

It might make sense to add the new package in a separate PR.

That seems reasonable to me.

# Which issue does this PR close? Closes #32240 / https://issues.apache.org/jira/browse/ARROW-16921 # What changes are included in this PR? This PR implements decompression support for Arrow IPC format files and streams in the dotnet/C# library. The main concern raised in the above Jira issue was that we don't want to add new NuGet package dependencies to support decompression formats that won't be needed by most users, so a default `CompressionProvider` implementation has been added that uses reflection to use the `ZstdNet` package for ZSTD decompression and `K4os.Compression.LZ4.Streams` and `CommunityToolkit.HighPerformance` for LZ4 Frame support if they are available. The `netstandard1.3` target has decompression support disabled due to some reflection functionality being missing, and neither `ZstdNet` or `K4os.Compression.LZ4.Streams` support `netstandard1.3`. The `ArrowFileReader` and `ArrowStreamReader` constructors accept an `ICompressionProvider` parameter to allow users to provide their own compression provider if they want to use different dependencies. ### Alternatives to consider An alternative approach that could be considered instead of reflection is to use these extra dependencies as build time dependencies but not make them dependencies of the NuGet package. I tested this out in adamreeve@4544afd and it seems to work reasonably well too but required bumping the version of `System.Runtime.CompilerServices.Unsafe` under the `netstandard2.0` and `netcoreapp3.1` targets. This reduces all the reflection boilerplate but seems pretty hacky as now Apache.Arrow.dll depends on these extra dlls and we rely on the dotnet runtime behaviour of not trying to load them until they're used. So I think the reflection approach is better. Another alternative would be to move decompression support into a separate NuGet package (eg. `Apache.Arrow.Compression`) that depends on `Apache.Arrow` and has an implementation of `ICompressionProvider` that users can pass in to the `ArrowFileReader` constructor, or maybe has a way to register itself with the `Apache.Arrow` package so it only needs to be configured once. That would seem cleaner to me but I'm not sure how much work it would be to set up a whole new package. # Are these changes tested? Yes, new unit tests have been added. Test files have been created with a Python script that is included in the PR due to only decompression support being added and not compression support. # Are there any user-facing changes? Yes, this implements a new feature but in a backwards compatible way. * Closes: #32240 Authored-by: Adam Reeve <[email protected]> Signed-off-by: Will Jones <[email protected]>

…che#33603) # Which issue does this PR close? Closes apache#32240 / https://issues.apache.org/jira/browse/ARROW-16921 # What changes are included in this PR? This PR implements decompression support for Arrow IPC format files and streams in the dotnet/C# library. The main concern raised in the above Jira issue was that we don't want to add new NuGet package dependencies to support decompression formats that won't be needed by most users, so a default `CompressionProvider` implementation has been added that uses reflection to use the `ZstdNet` package for ZSTD decompression and `K4os.Compression.LZ4.Streams` and `CommunityToolkit.HighPerformance` for LZ4 Frame support if they are available. The `netstandard1.3` target has decompression support disabled due to some reflection functionality being missing, and neither `ZstdNet` or `K4os.Compression.LZ4.Streams` support `netstandard1.3`. The `ArrowFileReader` and `ArrowStreamReader` constructors accept an `ICompressionProvider` parameter to allow users to provide their own compression provider if they want to use different dependencies. ### Alternatives to consider An alternative approach that could be considered instead of reflection is to use these extra dependencies as build time dependencies but not make them dependencies of the NuGet package. I tested this out in adamreeve@4544afd and it seems to work reasonably well too but required bumping the version of `System.Runtime.CompilerServices.Unsafe` under the `netstandard2.0` and `netcoreapp3.1` targets. This reduces all the reflection boilerplate but seems pretty hacky as now Apache.Arrow.dll depends on these extra dlls and we rely on the dotnet runtime behaviour of not trying to load them until they're used. So I think the reflection approach is better. Another alternative would be to move decompression support into a separate NuGet package (eg. `Apache.Arrow.Compression`) that depends on `Apache.Arrow` and has an implementation of `ICompressionProvider` that users can pass in to the `ArrowFileReader` constructor, or maybe has a way to register itself with the `Apache.Arrow` package so it only needs to be configured once. That would seem cleaner to me but I'm not sure how much work it would be to set up a whole new package. # Are these changes tested? Yes, new unit tests have been added. Test files have been created with a Python script that is included in the PR due to only decompression support being added and not compression support. # Are there any user-facing changes? Yes, this implements a new feature but in a backwards compatible way. * Closes: apache#32240 Authored-by: Adam Reeve <[email protected]> Signed-off-by: Will Jones <[email protected]>

…ReadOnlyMemory (#34108) This is a small follow-up to #33603 to support reading a compressed IPC stream from a `ReadOnlyMemory<byte>`, as I missed that this has a separate reader implementation. * Closes: #32240 Authored-by: Adam Reeve <[email protected]> Signed-off-by: Weston Pace <[email protected]>

…che#33603) # Which issue does this PR close? Closes apache#32240 / https://issues.apache.org/jira/browse/ARROW-16921 # What changes are included in this PR? This PR implements decompression support for Arrow IPC format files and streams in the dotnet/C# library. The main concern raised in the above Jira issue was that we don't want to add new NuGet package dependencies to support decompression formats that won't be needed by most users, so a default `CompressionProvider` implementation has been added that uses reflection to use the `ZstdNet` package for ZSTD decompression and `K4os.Compression.LZ4.Streams` and `CommunityToolkit.HighPerformance` for LZ4 Frame support if they are available. The `netstandard1.3` target has decompression support disabled due to some reflection functionality being missing, and neither `ZstdNet` or `K4os.Compression.LZ4.Streams` support `netstandard1.3`. The `ArrowFileReader` and `ArrowStreamReader` constructors accept an `ICompressionProvider` parameter to allow users to provide their own compression provider if they want to use different dependencies. ### Alternatives to consider An alternative approach that could be considered instead of reflection is to use these extra dependencies as build time dependencies but not make them dependencies of the NuGet package. I tested this out in adamreeve@4544afd and it seems to work reasonably well too but required bumping the version of `System.Runtime.CompilerServices.Unsafe` under the `netstandard2.0` and `netcoreapp3.1` targets. This reduces all the reflection boilerplate but seems pretty hacky as now Apache.Arrow.dll depends on these extra dlls and we rely on the dotnet runtime behaviour of not trying to load them until they're used. So I think the reflection approach is better. Another alternative would be to move decompression support into a separate NuGet package (eg. `Apache.Arrow.Compression`) that depends on `Apache.Arrow` and has an implementation of `ICompressionProvider` that users can pass in to the `ArrowFileReader` constructor, or maybe has a way to register itself with the `Apache.Arrow` package so it only needs to be configured once. That would seem cleaner to me but I'm not sure how much work it would be to set up a whole new package. # Are these changes tested? Yes, new unit tests have been added. Test files have been created with a Python script that is included in the PR due to only decompression support being added and not compression support. # Are there any user-facing changes? Yes, this implements a new feature but in a backwards compatible way. * Closes: apache#32240 Authored-by: Adam Reeve <[email protected]> Signed-off-by: Will Jones <[email protected]>

… from ReadOnlyMemory (apache#34108) This is a small follow-up to apache#33603 to support reading a compressed IPC stream from a `ReadOnlyMemory<byte>`, as I missed that this has a separate reader implementation. * Closes: apache#32240 Authored-by: Adam Reeve <[email protected]> Signed-off-by: Weston Pace <[email protected]>

…IPC decompression (#33893) ### Rationale for this change This further addresses #32240 and is a follow up to PR #33603 to provide an implementation of the `ICompressionCodecFactory` interface in a new `Apache.Arrow.Compression` package. Making this a separate package means users who don't need IPC decompression support don't need to pull in extra dependencies. ### What changes are included in this PR? Adds a new `Apache.Arrow.Compression` package and moves the existing compression codec implementations used for testing into this package. ### Are these changes tested? There are unit tests verifying the decompression support, but this also affects the release scripts and I'm not sure how to fully test these. ### Are there any user-facing changes? Yes, this adds a new package users can install for IPC decompression support, so documentation has been updated. * Closes: #32240 Lead-authored-by: Adam Reeve <[email protected]> Co-authored-by: Eric Erhardt <[email protected]> Signed-off-by: Eric Erhardt <[email protected]>

adamreeve mentioned this issue Jan 11, 2023

GH-32240: [C#] Support decompression of IPC format buffers #33603

Merged

asfimport mentioned this issue Jan 11, 2023

[C#] Support compression in IPC format #32370

Closed

github-actions bot assigned adamreeve Jan 11, 2023

wjones127 closed this as completed in #33603 Jan 26, 2023

wjones127 added this to the 12.0.0 milestone Jan 26, 2023

adamreeve mentioned this issue Jan 26, 2023

GH-32240: [C#] Add new Apache.Arrow.Compression package to implement IPC decompression #33893

Merged

github-actions bot mentioned this issue Feb 9, 2023

GH-32240: [C#] Support decompression when reading an IPC stream from ReadOnlyMemory #34108

Merged

adamreeve mentioned this issue Jan 30, 2024

[C#] Create implementation of ARROW-300 / IPC record batch body buffer compression #24834

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C#] Add decompression support for Record Batches #32240

[C#] Add decompression support for Record Batches #32240

asfimport commented Jun 28, 2022 •

edited

Loading

asfimport commented Jun 28, 2022

asfimport commented Jun 28, 2022

asfimport commented Jul 6, 2022

asfimport commented Jul 21, 2022

asfimport commented Aug 25, 2022

asfimport commented Aug 25, 2022

asfimport commented Sep 7, 2022

asfimport commented Sep 8, 2022

asfimport commented Sep 15, 2022

asfimport commented Dec 14, 2022

asfimport commented Dec 14, 2022

adamreeve commented Jan 11, 2023 •

edited

Loading

westonpace commented Jan 11, 2023

adamreeve commented Jan 11, 2023

westonpace commented Jan 11, 2023

[C#] Add decompression support for Record Batches #32240

[C#] Add decompression support for Record Batches #32240

Comments

asfimport commented Jun 28, 2022 • edited Loading

Related issues:

asfimport commented Jun 28, 2022

asfimport commented Jun 28, 2022

asfimport commented Jul 6, 2022

asfimport commented Jul 21, 2022

asfimport commented Aug 25, 2022

asfimport commented Aug 25, 2022

asfimport commented Sep 7, 2022

asfimport commented Sep 8, 2022

asfimport commented Sep 15, 2022

asfimport commented Dec 14, 2022

asfimport commented Dec 14, 2022

adamreeve commented Jan 11, 2023 • edited Loading

westonpace commented Jan 11, 2023

adamreeve commented Jan 11, 2023

westonpace commented Jan 11, 2023

asfimport commented Jun 28, 2022 •

edited

Loading

adamreeve commented Jan 11, 2023 •

edited

Loading