Support for io.Reader Interface in S3 Transfer Manager's Downloader #2247

yacchi · 2023-08-18T07:59:50Z

Describe the feature

Currently, the Download function implemented in the Transfer Managers Downloader accepts io.WriteAt. Due to this, after writing to a file or buffer, there is a need to create an io.Reader.

Frequently when working with files in Go, the io.Reader interface is commonly required. I believe that if the Downloader could directly produce an io.Reader, it would significantly improve usability.

Use Case

Decompressing archive files (e.g., tar, zip) stored in S3
- The standard Go modules for tar and zip utilize the io.Reader interface. With this feature, it would be possible to directly load and decompress files from S3.
- Efficiency is improved because writes to storage can be reduced, especially when dealing with large files.
Stream processing large files in environments with limited memory or storage
- There are scenarios, for instance, where you might want to compute digests (e.g., sha, md5) using AWS Lambda.
- By leveraging the io.Reader to process streams directly, it becomes possible to handle large files without using storage and with minimal memory consumption.

Proposed Solution

The behavior of the AWS CLI's cp command closely aligns with my expectations. For instance, it can be used as follows:

aws s3 cp s3://BUCKET/key.tar.gz | tar zxf -

Internally, it appears to use the heap to sequentially output chunks from the beginning.
I've created code in my repository that operates in a similar manner using the current Downloader.

https://github.com/yacchi/s3-fast-reader

Other Information

No response

Acknowledgements

I may be able to implement this feature request
This feature might incur a breaking change

AWS Go SDK V2 Module Versions Used

github.com/aws/aws-sdk-go-v2 v1.20.1
github.com/aws/aws-sdk-go-v2/config v1.18.33
github.com/aws/aws-sdk-go-v2/feature/s3/manager v1.11.77
github.com/aws/aws-sdk-go-v2/service/s3 v1.38.2

Go version used

1.20.4

The text was updated successfully, but these errors were encountered:

RanVaknin · 2023-08-22T17:34:57Z

Hi @yacchi,

This seems like a reasonable request. We will likely work on this when we re-implement Downloader.

For now we cannot prioritize this, but I will add this to our backlog.

Thanks!
Ran~

lucix-aws · 2023-10-18T15:41:54Z

To @yacchi or anyone else who may be watching this issue --

I haven't tested this, but I believe you can achieve "sequential" I/O by setting download concurrency to 1, which is spec'd to guarantee sequential in-order multipart downloads. If that's the case your WriterAt effectively becomes safe to write to any sequential I/O implementation.

// sequentialWriterAt adapts WriteAt() calls to a sequential I/O implementation
type sequentialWriterAt struct {
    w io.Writer // or copy to another reader, etc.
    off int
}

func (v *sequentialWriterAt) WriteAt(p []byte, off int64) (int, error) {
    if off != v.off {
        return 0, fmt.Errorf("broken write sequence")
    }

    n, err := v.w.Write(p)
    if err != nil {
        return n, fmt.Errorf("write: %v", err)
    }

    v.off += n
    return n, nil
}

This is definitely something we'd like to support for concurrent downloads, though.

yacchi · 2023-10-20T18:01:18Z

@lucix-aws

Thanks for the helpful code. It looks like I can certainly do sequentialization that way.
However, if you don't want concurrency, I think the standard GetObject method is easier because it provides the io.Reader interface.

I implemented the io.Reader interface, which allows concurrency, to solve the following problem.

Using tar files stored in S3 in the Fargate task.
The size of the tar file is about 3 GiB.
Want to deploy tar contents to task storage as fast as possible.
The tar file is directly expanded because the speed drops to about 150 MiB/s when IO flushes occur.

By using the implemented code, it is possible to download the tar file at an average speed of 800MiB/s or more in an environment with sufficient memory.

jobstoit · 2024-04-27T13:55:42Z

@lucix-aws @RanVaknin I opened a PR for a concurrent io.Reader/io.WriteCloser for this issue. See #2622

jobstoit · 2024-05-05T11:07:55Z

@yacchi the PR was rejected cause they want to overhaul the manager. until they'd have something you could use my s3io package to do read/write operations with s3 objects (concurrently with chunks similar to how the manager handles files)

pelletier197 · 2024-10-28T14:47:15Z

Would definitely appreciate having a way to handle this as part of the SDK, as this brings quite a lot of complexity to usage. Also, the current proposed approach (manager.NewWriteAtBuffer) forces the user to hold the whole file in memory, which is a little dumb since it's a large file....

If this can be of help to anyone, I have implemented an io.WriteAt that handles concurrency of manager.Download. This allows to use an io.Writer as output, while making sure byte chunks arrive in the proper order on the other end.

This might use up to concurrency * chunk size bytes of RAM, which is many times better than keeping everything in memory (for very large files).

// WriteAtWriter is an implementation that handles write concurrency from the AWS S3 download manager.
// It works by storing pending bytes to its internal buffer and flushing them when the expected offset is reached.
type WriteAtWriter struct {
	writer  io.WriteCloser
	offset  int64
	mu      sync.Mutex
	pending map[int64][]byte
}

func NewWriteAtWriter(writer io.WriteCloser) *WriteAtWriter {
	return &WriteAtWriter{writer: writer, offset: 0, pending: make(map[int64][]byte)}
}

func (fw *WriteAtWriter) WriteAt(p []byte, offset int64) (n int, err error) {
	fw.mu.Lock()
	defer fw.mu.Unlock()

	// If chunk is current expected chunk, write it and flush pending chunks
	if offset == fw.offset {
		written, err := fw.SendAll(p)

		fw.offset += int64(written)

		if err != nil {
			return written, err
		}

		err = fw.flush()
		if err != nil {
			return written, err
		}

		return written, nil
	}

	// Otherwise, queue the chunk
	// Copy is required, because AWS re-uses the `p` slice under the hood.
	copied := make([]byte, len(p))
	copy(copied, p)
	fw.pending[offset] = copied
	return len(p), nil
}

func (fw *WriteAtWriter) flush() error {
	for {
		p, ok := fw.pending[fw.offset]
		if !ok {
			break
		}

		written, err := fw.SendAll(p)
		if err != nil {
			return err
		}

		delete(fw.pending, fw.offset)
		fw.offset += int64(written)
	}

	return nil
}

func (fw *WriteAtWriter) SendAll(bytes []byte) (int, error) {
	written := 0

	for written < len(bytes) {
		n, err := fw.writer.Write(bytes[written:])
		if err != nil {
			return 0, err
		}

		written += n
	}

	return written, nil
}

jobstoit · 2024-10-28T20:36:44Z

@pelletier197 you can use my s3io package if you want to eliminate that complexity.

yacchi added feature-request A feature should be added or improved. needs-triage This issue or PR still needs to be triaged. labels Aug 18, 2023

RanVaknin self-assigned this Aug 22, 2023

RanVaknin removed the needs-triage This issue or PR still needs to be triaged. label Aug 22, 2023

RanVaknin added p2 This is a standard priority issue queued This issues is on the AWS team's backlog labels Aug 22, 2023

lucix-aws added the feature/s3/manager Pertains to S3 transfer manager HLL (feature/s3/manager). label Oct 17, 2023

lucix-aws removed the queued This issues is on the AWS team's backlog label Feb 9, 2024

RanVaknin added the queued This issues is on the AWS team's backlog label Feb 15, 2024

jobstoit mentioned this issue Apr 26, 2024

feat: add object io reader/writer to manager #2622

Closed

This was referenced May 10, 2024

Support io.Writer in s3 transfer manager downloader #2036

Closed

Downloader Concurrency of 1 does NOT download parts sequentially #1533

Open

lucix-aws mentioned this issue May 21, 2024

feature/s3/manager needs an overhaul #2649

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for io.Reader Interface in S3 Transfer Manager's Downloader #2247

Support for io.Reader Interface in S3 Transfer Manager's Downloader #2247

yacchi commented Aug 18, 2023 •

edited

Loading

RanVaknin commented Aug 22, 2023

lucix-aws commented Oct 18, 2023

yacchi commented Oct 20, 2023 •

edited

Loading

jobstoit commented Apr 27, 2024

jobstoit commented May 5, 2024

pelletier197 commented Oct 28, 2024 •

edited

Loading

jobstoit commented Oct 28, 2024

Support for io.Reader Interface in S3 Transfer Manager's Downloader #2247

Support for io.Reader Interface in S3 Transfer Manager's Downloader #2247

Comments

yacchi commented Aug 18, 2023 • edited Loading

Describe the feature

Use Case

Proposed Solution

Other Information

Acknowledgements

AWS Go SDK V2 Module Versions Used

Go version used

RanVaknin commented Aug 22, 2023

lucix-aws commented Oct 18, 2023

yacchi commented Oct 20, 2023 • edited Loading

jobstoit commented Apr 27, 2024

jobstoit commented May 5, 2024

pelletier197 commented Oct 28, 2024 • edited Loading

jobstoit commented Oct 28, 2024

yacchi commented Aug 18, 2023 •

edited

Loading

yacchi commented Oct 20, 2023 •

edited

Loading

pelletier197 commented Oct 28, 2024 •

edited

Loading