[Upload] Handle both Files and InputStreams #2727

Amraneze · 2024-09-23T19:45:55Z

Upload

Description

We are using Google Cloud Storage to download (decompress) and upload those decompressed files again to Google Cloud Storage, the problem is that we are using InputStream to not overload the heap memory of the application. For that, we want to handle both cases for uploading files or input stream.

Solution

I drafted this PR#2728 as an example of what we need

Alternatives

Sticking to normal upload with Google Cloud Storage client

BenWhitehead · 2024-09-23T20:41:30Z

Hi,

A large reason TransferManager only accepts Paths, is that Paths allow minimal memory overhead as the bytes are on disk and can therefore be read and uploaded in a small incremental fashion (8KiB at a time). Additionally, if an upload is interrupted with a retryable error we can retry from any arbitrary offset.

When an InputStream is provided to us, we have to switch to a chunked approach where we will buffer up to a certain amount of bytes (default 16MiB) before flushing that buffer to GCS. Reason being, InputStreams are not universally rewindable and if an interrupt happens while uploading the whole upload would fail. Especially an InputStream from Channels.newInputStream(storage.reader(BlobId.of("bucket-name", "object-name"))).

Transferring objects between buckets is something Storage Transfer Service has been purpose built to perform in a managed performant manner. A GCS bucket can be both a source and sink. An example of how you might transition all objects to nearline storage class should give you an idea of how to get started https://cloud.google.com/storage-transfer/docs/create-transfers#client-libraries then click the Java.

Amraneze · 2024-10-10T21:17:25Z

Hi,

A large reason TransferManager only accepts Paths, is that Paths allow minimal memory overhead as the bytes are on disk and can therefore be read and uploaded in a small incremental fashion (8KiB at a time). Additionally, if an upload is interrupted with a retryable error we can retry from any arbitrary offset.

When an InputStream is provided to us, we have to switch to a chunked approach where we will buffer up to a certain amount of bytes (default 16MiB) before flushing that buffer to GCS. Reason being, InputStreams are not universally rewindable and if an interrupt happens while uploading the whole upload would fail. Especially an InputStream from Channels.newInputStream(storage.reader(BlobId.of("bucket-name", "object-name"))).

That make sense. Thanks @BenWhitehead for the info (Sorry for the late reply, I was away)

Transferring objects between buckets is something Storage Transfer Service has been purpose built to perform in a managed performant manner. A GCS bucket can be both a source and sink. An example of how you might transition all objects to nearline storage class should give you an idea of how to get started https://cloud.google.com/storage-transfer/docs/create-transfers#client-libraries then click the Java.

We want to decompress some gzip/zip files in Google Cloud Storage, I don't know if Storage Transfer Service can help us with that.

BenWhitehead · 2024-10-11T17:13:32Z

If STS can't do it, the most reliable way (both the reader and writer have transparent retries under the hood) to make it happen would be to do the following:

    StorageOptions options = StorageOptions.newBuilder().build();
    try (Storage s = options.getService()) {
      BlobId from = BlobId.of("<some-bucket-1>", "<some-object-1>");
      BlobId to = BlobId.of("<some-bucket-2>", from.getName());

      try (ReadChannel r = s.reader(from,
          // pass the option to ensure the contents are gunzipped
          BlobSourceOption.shouldReturnRawInputStream(false)
      );
          WriteChannel w = s.writer(BlobInfo.newBuilder(to).build(), BlobWriteOption.doesNotExist())) {
        // disable buffering in the read channel
        r.setChunkSize(0);
        // set to to something smaller if you want to reduce the amount of buffering
        // w.setChunkSize(16 * 1024 * 1024);
        ByteStreams.copy(r, w);
      }
    }

I've tested the above code and can attest to it working. My source object was a 1.5MiB gzip'ed text file, when unzipped and copied it was expanded to 512MiB without gzip.

Since you mentioned memory footprint as being important I've made a couple tweaks to chunkSize to change from the defaults.

Since you are going to be writing more bytes than you are reading, the write will end up being the slower of the two. The w.setChunkSize can take a value as small as 256KiB (256 * 1024), but that will be very slow as it will be writing 256KiB to GCS and ensuring the bytes are ack'd before accepting more bytes. The default is 16MiB and provides decent throughput, but if you're more worried about memory footprint something like 4MiB might be a good starting point.

Amraneze · 2024-10-11T17:25:28Z

Thanks @BenWhitehead for the proposal. I will try it out and see the results.

product-auto-label bot added the api: storage Issues related to the googleapis/java-storage API. label Sep 23, 2024

Amraneze mentioned this issue Sep 23, 2024

feat(2727): handle both Files and Inputstreams for uploads #2728

Closed

4 tasks

BenWhitehead added the type: question Request for information or clarification. Not an issue. label Sep 23, 2024

BenWhitehead closed this as completed Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Upload] Handle both Files and InputStreams #2727

[Upload] Handle both Files and InputStreams #2727

Amraneze commented Sep 23, 2024 •

edited

Loading

BenWhitehead commented Sep 23, 2024

Amraneze commented Oct 10, 2024 •

edited

Loading

BenWhitehead commented Oct 11, 2024

Amraneze commented Oct 11, 2024

[Upload] Handle both Files and InputStreams #2727

[Upload] Handle both Files and InputStreams #2727

Comments

Amraneze commented Sep 23, 2024 • edited Loading

Upload

Description

Solution

Alternatives

BenWhitehead commented Sep 23, 2024

Amraneze commented Oct 10, 2024 • edited Loading

BenWhitehead commented Oct 11, 2024

Amraneze commented Oct 11, 2024

Amraneze commented Sep 23, 2024 •

edited

Loading

Amraneze commented Oct 10, 2024 •

edited

Loading