Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload manager: When using io.Reader without seeking support, the uploader over allocates memory leading to high usage #2694

Open
2 tasks done
erezrokah opened this issue Jun 24, 2024 · 1 comment
Labels
feature/s3/manager Pertains to S3 transfer manager HLL (feature/s3/manager). feature-request A feature should be added or improved. p2 This is a standard priority issue

Comments

@erezrokah
Copy link

erezrokah commented Jun 24, 2024

Acknowledgements

Describe the bug

This is basically the same as #1302, but I think there's a possible optimization to be done to improve memory usage.

When uploading small files that don't require multi-part uploads and using a reader that doesn't support seeking, the uploader code always allocates at least a 5MB buffer (named partSize in the code) to read data from the buffer.
For example if you upload a 10KB file, 5MB will get allocated.
If you upload 200 10KB files in parallel, ~1GB of memory will be allocated, instead of ~2MB that's needed.

The 5MB limitation comes from the minimal upload size for multipart files, but I don't see a reason to couple it with the buffer that's used for reading from the stream.

Expected Behavior

Only allocate the memory needed for uploading the file if it's under 5MB

Current Behavior

Memory consumption is unnecessarily high

Reproduction Steps

#1302 has those in detail

Possible Solution

Solution 1. Use a LimitReader to read up to partSize for the first read instead of using the pool. If it's below 5MB don't use the pool at all, if it's above revert to old behavior, see proposal here

Solution 2. Drop the pool altogether and let GC manage memory. Always use LimitReader to read up to partSize and let it deal with allocations. The thing is with the pool is that it's not cleared until the upload finishes, so let's say you have concurrency 5 and part size 100MB, that memory will not be cleared until the end of the upload (assuming you have a file bigger than 500MB). Let's say one of the last parts of the upload is slow for some reason, the memory for the already uploaded parts will still be held. The GC should be smart enough (I hope) to re-use buffers and clear them if another part of the process needs them more.

Additional Information/Context

Our current workaround is to do:

// Non seeker reader
allData, err := io.ReadAll(r)
if err != nil {
	return err
}
// Pass this to uploader
readerSeeker := bytes.NewReader(allData)

But that has the downside of allocating a large byte array for large files

AWS Go SDK V2 Module Versions Used

github.com/aws/aws-sdk-go-v2/feature/s3/manager v1.16.24

Compiler and Version used

go version go1.22.2 darwin/arm64

Operating System and version

MacOS 14.5 (23F79)

@erezrokah erezrokah added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jun 24, 2024
@lucix-aws lucix-aws added feature/s3/manager Pertains to S3 transfer manager HLL (feature/s3/manager). p2 This is a standard priority issue feature-request A feature should be added or improved. and removed needs-triage This issue or PR still needs to be triaged. bug This issue is a bug. labels Jun 24, 2024
@lucix-aws
Copy link
Contributor

Filing as a feature request, will likely be OBE as we move towards a redo of the transfer manager #2649.

kodiakhq bot pushed a commit to cloudquery/cloudquery that referenced this issue Jul 18, 2024
…#18608)

#### Summary

Follow up to #18361.
When `no_rotate: true` we can get large files to upload and when that happens we'll be holding the whole file in memory and eventually running out of it.
An example is a sync on many AWS accounts (for example 2000), even for a single table. Let's say the table data is 10MB per account, we end up with a file of 20GB.

The downside of this change is that if a sync has a single account but many tables (and `no_rotate: true`), we'll run into the same issue #18361 fixed.

There are workarounds for both issues (either reducing the number of accounts in the sync, or reducing the concurrency).

The best solution would be to get aws/aws-sdk-go-v2#2694 fixed (suggested solution in https://github.com/aws/aws-sdk-go-v2/compare/main...erezrokah:fix/dont_over_allocate_small_files?expand=1), but doesn't seems like it's going to happen soon.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/s3/manager Pertains to S3 transfer manager HLL (feature/s3/manager). feature-request A feature should be added or improved. p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

2 participants