Skip to content
This repository has been archived by the owner on Jan 17, 2024. It is now read-only.

Is it possible to get pointer of Uint8List instead of allocating then copy #31

Closed
hanabi1224 opened this issue Dec 16, 2019 · 16 comments
Closed

Comments

@hanabi1224
Copy link

hanabi1224 commented Dec 16, 2019

Echo #27
I have a package that binds to native lz4 compression lib via dart ffi, however, with this restriction, I have to duplicate the data before compression or decompression, which I really really want to avoid.

I understand a dart reference is GC controlled and I can play some tricks to hold the reference to avoid unexpected GC kick in, rather than actually duplicating data before decompression, please let me know your concerns. Thanks!

BTW, If you'd like to understand my scenario better, source code is here

@mkustermann
Copy link

I have to duplicate the data before compression or decompression, which I really really want to avoid.

The simplest solution would be to use C memory as the buffer. You can use a Uint8List view on that buffer in Dart. Then there is no need to copy, since you can pass the underlying Pointer<UInt8> to C code (i.e. all your code operates on C memory - Pointer.asTypedList() can be used for filling in the data).

I can play some tricks to hold the reference to avoid unexpected GC kick in ...

With pure usage of dart:ffi this is not possible and doing unsafe things might cause random crashes.
A safe way to do this is to use our existing native extensions. (Where you can use dart_api.h:Dart_TypedDataAcquireData - though in Flutter this is currently not supported)

If you'd like to understand my scenario better, source code is here

You code e.g. contains

    var sourceBuffer = List<int>();
    ...
    await for (final chunk in stream) {
      ...
      sourceBuffer.addAll(chunk);
      ...
      nextSrcSize = _decompressFrame(
          context, dstBuffer, dstSizePtr, srcBuffer, srcSizePtr);
      ...
      sourceBuffer = sourceBuffer.sublist(consumedSrcSize);

Firstly, please note that using List<int> as a representation for bytes is highly inefficient, it's better to use BytesBuidler or (in your case) even better C memory - on 64-bit systems it will use only 1/8th of the memory.

Secondly, instead of using the sourceBuffer = List<int>() you can just use a C-buffer, something like this:

    int sourceBufferPos = 0;
    int bufferSize = 4 * 1024;
    var sourceBuffer = allocate<Uint8>(bufferSize);
    //                            ^^^^^^^^^^^^^^^^^^^^^^^^^^-- make a generous estimate of the chunk sizes we get
    ...
    await for (final chunk in stream) {
      ...
      // <- some code to resize [sourceBuffer] if our estimate above was too small
      sourceBuffer.asTypedList(bufferSize).setRange(sourceBufferPos, <length>, ...);
      ...
      nextSrcSize = _decompressFrame(..., sourceBuffer, ...);

Another comment regarding your code, you can use the [] operator on Pointer, which makes it look much better:

  ... 
  srcSizePtr[0] = sourceBuffer.length;   // <-- Instead of srcSizePtr.asTypedList(1).setAll(0, [sourceBuffer.length]);
  ...
  final consumedSrcSize = srcSizePtr[0];  // <-- Instead of srcSizePtr.elementAt(0).value

Lastly, please notice that the sourceBuffer.addAll(chunk) makes a copy of the chunk data and sourceBuffer.sublist(consumedSrcSize) makes another copy of the data. It's better to avoid this.

@mkustermann
Copy link

As always, if you can reduce your code to a small benchmark we could take a look and see if we can make something faster.

Exposing pointers to Dart VM heap managed objects would need to be thought through very carefully, so it's unlikely to happen very soon.

@hanabi1224
Copy link
Author

@mkustermann Thanks for the hints! Good to know the plan, I will create some FFI functions to allocate native memory directly then.

@hanabi1224
Copy link
Author

@mkustermann Also I cannot find size_t in dart FFI so in my code I use Uint64 which is not accurate either and will cause crash in non x64 build. Any suggestions?

@hanabi1224
Copy link
Author

@mkustermann Regarding the native buffer suggestion, I thought about it but I feel it makes things worse. No matter how we implement this, the compress / decompress function has input buffer from dart heap, and its output buffer will eventually land on dart heap (thus accessible from dart functions), the problem here is the native code algorithm does not have access to dart heap, thus we make a copy to C heap and copy it back (to dart heap), which is definitely not efficient. Given input and output have to be on dart heap, the most efficient way must be to give native code access to it (dart heap), so only if dart vm provides 2 APIs, we can achieve this efficiently. 1. be able to pass Uint8List as Pointer to grant native code access to the raw input. 2. (this is actually half way done), grant native code access to output on dart heap (I assume allocate function in dart:ffi does this), what is missing here is, to let a dart reference (Uint8List) claim the memory of the output buffer with GC taking control (not sure if Uint8List.view lets GC take control of deallocating the memory). Does that make sense?

@dcharkes
Copy link
Contributor

its output buffer will eventually land on dart heap (thus accessible from dart functions)

That is not necessary. You can have the buffer live in C heap, and expose it to Dart with asTypedList().

what is missing here is, to let a dart reference (Uint8List) claim the memory of the output buffer with GC taking control (not sure if Uint8List.view lets GC take control of deallocating the memory).

Finalizers are tracked in dart-lang/sdk#35770.

thus we make a copy to C heap [..], which is definitely not efficient.

As @mkustermann mentioned, please provide a benchmark so that we can assess whether it's the copying that's slow, or whether it's something else that we should optimize.

@mkustermann
Copy link

@hanabi1224 If you follow our advice (**) - how much time is spent in copying vs performing the actual compression/decompression in flutter-release mode?

(**) By having two buffers in C (the input/output buffers for the compression/decompression) and copy into the input buffer before compression/decompression and copy out of the C buffer after compression/decompression -- and perform no other copies.

@hanabi1224
Copy link
Author

hanabi1224 commented Dec 17, 2019

@dcharkes @mkustermann I have refined my code with @mkustermann 's suggestions to replace List with BytesBuilder, but still one-shot decompression is much faster than stream decompression, which uses exactly the same native API, the only difference from what I can tell is, in stream api, it performs multiple copy of source buffer.

I just created a benchmark here Bench result on CI: https://cirrus-ci.com/task/5740408574836736
image

BTW, if you want to run the bench locally, in addition to c compiler, rust is needed as well (Install command: curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --default-toolchain stable)

@hanabi1224
Copy link
Author

hanabi1224 commented Dec 17, 2019

@mkustermann Regarding your advice ** in one-shot compression / decompression input / output buffer is only allocated once (currently with dart:ffi allocate method, I tried native allocation but I dont see difference), in stream decompression mode, input buffer is streamed so that it's impossible to copy only once, the output buffer is reused so it's only allocated once (with dart:ffi allocate method) tho.

@dcharkes
Copy link
Contributor

@hanabi1224 I have a bit of trouble building the rust lib:

flutter_native_extensions/src/compression/native_compression$ cargo build
   Compiling cc v1.0.48
   Compiling libc v0.2.66
   Compiling dart_native_compression v0.1.0 (/usr/local/google/home/dacoharkes/ffi_samples/hanabi1224/flutter_native_extensions/src/compression/native_compression)
error: failed to run custom build command for `dart_native_compression v0.1.0 (/usr/local/google/home/dacoharkes/ffi_samples/hanabi1224/flutter_native_extensions/src/compression/native_compression)`

Caused by:
  process didn't exit successfully: `/usr/local/google/home/dacoharkes/ffi_samples/hanabi1224/flutter_native_extensions/src/compression/native_compression/target/debug/build/dart_native_compression-e835b0d21fb18870/build-script-build` (exit code: 1)
--- stdout
TARGET = Some("x86_64-unknown-linux-gnu")
HOST = Some("x86_64-unknown-linux-gnu")
CC_x86_64-unknown-linux-gnu = None
CC_x86_64_unknown_linux_gnu = None
HOST_CC = None
CC = None
CFLAGS_x86_64-unknown-linux-gnu = None
CFLAGS_x86_64_unknown_linux_gnu = None
HOST_CFLAGS = None
CFLAGS = None
CRATE_CC_NO_DEFAULTS = None
DEBUG = Some("true")
CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
running: "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-g" "-fno-omit-frame-pointer" "-m64" "-Wall" "-Wextra" "-o" "/usr/local/google/home/dacoharkes/ffi_samples/hanabi1224/flutter_native_extensions/src/compression/native_compression/target/debug/build/dart_native_compression-088e9ac0ff50ee75/out/liblz4/lib/lz4.o" "-c" "liblz4/lib/lz4.c"
cargo:warning=cc: error: liblz4/lib/lz4.c: No such file or directory
cargo:warning=cc: fatal error: no input files
cargo:warning=compilation terminated.
exit code: 1

--- stderr


error occurred: Command "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-g" "-fno-omit-frame-pointer" "-m64" "-Wall" "-Wextra" "-o" "/usr/local/google/home/dacoharkes/ffi_samples/hanabi1224/flutter_native_extensions/src/compression/native_compression/target/debug/build/dart_native_compression-088e9ac0ff50ee75/out/liblz4/lib/lz4.o" "-c" "liblz4/lib/lz4.c" with args "cc" did not execute successfully (status code exit code: 1).

@hanabi1224
Copy link
Author

hanabi1224 commented Dec 18, 2019

@dcharkes Ah it's the submodule of zlib source code, sry for not mentioning that, please run
git submodule init && git pull edit:

git submodule update --init --recursive

@dcharkes
Copy link
Contributor

dcharkes commented Dec 18, 2019

flutter_native_extensions/src/compression/native_compression/liblz4 stays an empty folder.

edit: git submodule update was needed.

@dcharkes
Copy link
Contributor

Replace all your BytesBuilder() with BytesBuilder(copy:false).

decompressFrameOneShot(RunTime): 5443367.0 us.
decompressFrameStream(RunTime): 7860634.0 us.
decompressFrameStream2(RunTime): 10691867.0 us.

-->

decompressFrameOneShot(RunTime): 6256975.0 us.
decompressFrameStream(RunTime): 7799498.0 us.
decompressFrameStream2(RunTime): 162550.61538461538 us.

Explanation

$ perf record -g dart --generate-perf-events-symbols main.dart 
Compiling native code...
Compressing data...
Running benchmark...
decompressFrameStream2(RunTime): 11753156.0 us.
[ perf record: Woken up 40 times to write data ]
[kernel.kallsyms] with build id 83248d7cc19f05e9c0f581e73eaf7fbb9dda08cc not found, continuing without symbols
[ perf record: Captured and wrote 10.893 MB perf.data (55851 samples) ]

$ perf report --no-children -i perf.data

Which shows that BytesBuilder is copying data.

Screenshot from 2019-12-18 13-52-12

The thing that is slowest after that is asTypedList(), I've filed dart-lang/sdk#39843 to address this.

@hanabi1224
Copy link
Author

@dcharkes Awesome, thanks!

@hanabi1224
Copy link
Author

@dcharkes Regarding the effort of replacing BytesBuilder with BytesBuilder(copy:false) , it causes test failures (The pointer it copies from being deallocated later), I have updated the code to use it only when possible.

My another question is, how to efficiently make a slice of BytesBuilder? Not sure if I'm doing it properly

final tmpBuffer = sourceBufferBuilder.takeBytes();
final remaining = Uint8List.view(tmpBuffer.buffer, consumedSrcSize,
     tmpBuffer.length - consumedSrcSize);
sourceBufferBuilder.add(remaining);

@dcharkes
Copy link
Contributor

dcharkes commented Feb 9, 2021

Closing old issue, please reopen if issue persists.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

3 participants