Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Upload] Handle both Files and InputStreams #2727

Closed
Amraneze opened this issue Sep 23, 2024 · 4 comments
Closed

[Upload] Handle both Files and InputStreams #2727

Amraneze opened this issue Sep 23, 2024 · 4 comments
Labels
api: storage Issues related to the googleapis/java-storage API. type: question Request for information or clarification. Not an issue.

Comments

@Amraneze
Copy link

Amraneze commented Sep 23, 2024

Upload

Description

We are using Google Cloud Storage to download (decompress) and upload those decompressed files again to Google Cloud Storage, the problem is that we are using InputStream to not overload the heap memory of the application. For that, we want to handle both cases for uploading files or input stream.

Solution

I drafted this PR#2728 as an example of what we need

Alternatives

Sticking to normal upload with Google Cloud Storage client

@product-auto-label product-auto-label bot added the api: storage Issues related to the googleapis/java-storage API. label Sep 23, 2024
@BenWhitehead
Copy link
Collaborator

Hi,

A large reason TransferManager only accepts Paths, is that Paths allow minimal memory overhead as the bytes are on disk and can therefore be read and uploaded in a small incremental fashion (8KiB at a time). Additionally, if an upload is interrupted with a retryable error we can retry from any arbitrary offset.

When an InputStream is provided to us, we have to switch to a chunked approach where we will buffer up to a certain amount of bytes (default 16MiB) before flushing that buffer to GCS. Reason being, InputStreams are not universally rewindable and if an interrupt happens while uploading the whole upload would fail. Especially an InputStream from Channels.newInputStream(storage.reader(BlobId.of("bucket-name", "object-name"))).

Transferring objects between buckets is something Storage Transfer Service has been purpose built to perform in a managed performant manner. A GCS bucket can be both a source and sink. An example of how you might transition all objects to nearline storage class should give you an idea of how to get started https://cloud.google.com/storage-transfer/docs/create-transfers#client-libraries then click the Java.

@BenWhitehead BenWhitehead added the type: question Request for information or clarification. Not an issue. label Sep 23, 2024
@Amraneze
Copy link
Author

Amraneze commented Oct 10, 2024

Hi,

A large reason TransferManager only accepts Paths, is that Paths allow minimal memory overhead as the bytes are on disk and can therefore be read and uploaded in a small incremental fashion (8KiB at a time). Additionally, if an upload is interrupted with a retryable error we can retry from any arbitrary offset.

When an InputStream is provided to us, we have to switch to a chunked approach where we will buffer up to a certain amount of bytes (default 16MiB) before flushing that buffer to GCS. Reason being, InputStreams are not universally rewindable and if an interrupt happens while uploading the whole upload would fail. Especially an InputStream from Channels.newInputStream(storage.reader(BlobId.of("bucket-name", "object-name"))).

That make sense. Thanks @BenWhitehead for the info (Sorry for the late reply, I was away)

Transferring objects between buckets is something Storage Transfer Service has been purpose built to perform in a managed performant manner. A GCS bucket can be both a source and sink. An example of how you might transition all objects to nearline storage class should give you an idea of how to get started https://cloud.google.com/storage-transfer/docs/create-transfers#client-libraries then click the Java.

We want to decompress some gzip/zip files in Google Cloud Storage, I don't know if Storage Transfer Service can help us with that.

@BenWhitehead
Copy link
Collaborator

If STS can't do it, the most reliable way (both the reader and writer have transparent retries under the hood) to make it happen would be to do the following:

    StorageOptions options = StorageOptions.newBuilder().build();
    try (Storage s = options.getService()) {
      BlobId from = BlobId.of("<some-bucket-1>", "<some-object-1>");
      BlobId to = BlobId.of("<some-bucket-2>", from.getName());

      try (ReadChannel r = s.reader(from,
          // pass the option to ensure the contents are gunzipped
          BlobSourceOption.shouldReturnRawInputStream(false)
      );
          WriteChannel w = s.writer(BlobInfo.newBuilder(to).build(), BlobWriteOption.doesNotExist())) {
        // disable buffering in the read channel
        r.setChunkSize(0);
        // set to to something smaller if you want to reduce the amount of buffering
        // w.setChunkSize(16 * 1024 * 1024);
        ByteStreams.copy(r, w);
      }
    }

I've tested the above code and can attest to it working. My source object was a 1.5MiB gzip'ed text file, when unzipped and copied it was expanded to 512MiB without gzip.

Since you mentioned memory footprint as being important I've made a couple tweaks to chunkSize to change from the defaults.

Since you are going to be writing more bytes than you are reading, the write will end up being the slower of the two. The w.setChunkSize can take a value as small as 256KiB (256 * 1024), but that will be very slow as it will be writing 256KiB to GCS and ensuring the bytes are ack'd before accepting more bytes. The default is 16MiB and provides decent throughput, but if you're more worried about memory footprint something like 4MiB might be a good starting point.

@Amraneze
Copy link
Author

Thanks @BenWhitehead for the proposal. I will try it out and see the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the googleapis/java-storage API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants