Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uploading small files of unknown size uses too much memory #848

Closed
rwestlund opened this issue Oct 18, 2017 · 14 comments
Closed

Uploading small files of unknown size uses too much memory #848

rwestlund opened this issue Oct 18, 2017 · 14 comments

Comments

@rwestlund
Copy link

Version

minio-go 3.0.3
FreeBSD 11.1/amd64
Go 1.9

Background

My app accepts multipart file uploads over HTTP (thus the file size can not be known) and uses this library to stream them directly into an S3-compatible store. I run it on a small VPS with 512 MiB RAM and a tiny hard drive. The purpose of streaming them directly into S3 is to prevent buffering large files in RAM or on disk.

Expected Behavior

I should be able to stream small files of unknown size into S3 without using lots of memory.

Actual Behavior

When minio-go starts uploading a file of unknown size, optimalPartInfo(-1) assumes the worst case of needing to fit a 5 TiB file within 10000 chunks and allocates a 576 MiB buffer.

The Problem

Allocating this massive buffer causes my server to grind to a halt for 60 seconds while the page daemon swaps to make room for this large allocation that can't fit in RAM. This is silly when I know I'll never handle a file larger than 10 GiB.

Temporary Workaround

Fork and modify constants.go.

Solutions

  • The constants could be configurable, and belong to the Client struct.
  • The new PutObjectOptions struct you're adding could have a field for the maximum buffer size or maximum expected file size.
@harshavardhana
Copy link
Member

Please use master to override the part size by providing the right size or use the min part size override.

Or for now if you wish to use 3.0.3 - fork the code and change it locally.

harshavardhana added a commit to harshavardhana/minio-go that referenced this issue Oct 18, 2017
This is done to ensure that caller can control the amount of
memory being used when uploading a stream without knowing
prior length.

Fixes minio#848
@kannappanr kannappanr added this to the Current milestone Oct 18, 2017
@rwestlund
Copy link
Author

As stated above, the file size is unknown.

I can't find anything in the documentation about a min part size override. What do you mean?

@harshavardhana
Copy link
Member

harshavardhana commented Oct 19, 2017

File size cannot be unknown in S3 .We just allow it so that we can buffer and upload as a convenience. You shouldn't build an application around this behavior when you don't have enough RAM.

In any case there is a PR submitted to allow this to be configurable.. so that the overall buffer limit can be set manually.

NOTE: this will also change the total size of the object you can upload.

@rwestlund
Copy link
Author

Thanks for the PR, that looks perfect.

@harshavardhana
Copy link
Member

We are not going to fix in the way mentioned by PR #849 instead i closed it. Here is a way to achieve what you want in a more idiomatic way.

package main

import (
        "bytes"
        "fmt"
        "io"
        "log"
        "os"

        "github.com/cheggaaa/pb"
        minio "github.com/minio/minio-go"
)

func main() {
        s3Client, err := minio.New("play.minio.io:9000", "Q3AM3UQ867SPQQA43P2F", "zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG", true)
        if err != nil {
                log.Fatalln(err)
        }

        var (
                totalParts = 10000
                partSize   = int64(10 * 1024 * 1024)
                i          = 0
        )
        var srcs = make([]minio.SourceInfo, 10000)

        progress := pb.New64(-1)
        progress.Start()

        var buf = make([]byte, partSize)
        for i < totalParts {
                length, rErr := io.ReadFull(os.Stdin, buf)
                if rErr == io.EOF && i > 1 {
                        break
                }
                if rErr != nil && rErr != io.ErrUnexpectedEOF {
                        log.Fatalln(rErr)
                }

                s3Client.TraceOn(os.Stdout)
                if _, err = s3Client.PutObject("testbucket", fmt.Sprintf("parts/%d", i), bytes.NewReader(buf[:length]), int64(len(buf[:length])),
                        minio.PutObjectOptions{
                                Progress: progress,
                        }); err != nil {
                        log.Fatalln(err)
                }

                // Source objects to concatenate.
                srcs[i] = minio.NewSourceInfo("testbucket", fmt.Sprintf("parts/%d", i), nil)
                i++
        }

        // Create destination info
        dst, err := minio.NewDestinationInfo("testbucket", "final-object", nil, nil)
        if err != nil {
                log.Fatalln(err)
        }

        if err = s3Client.ComposeObject(dst, srcs[:i]); err != nil {
                log.Fatalln(err)
        }

        objectsCh := make(chan string)

        // Send object names that are needed to be removed to objectsCh
        go func() {
                defer close(objectsCh)
                for j := 0; j < i+1; j++ {
                        objectsCh <- fmt.Sprintf("parts/%d", j)
                }
        }()

        // Call RemoveObjects API
        for e := range s3Client.RemoveObjects("my-bucketname", objectsCh) {
                log.Fatalln("Failed to remove " + e.ObjectName + ", error: " + e.Err.Error())
        }

}

@rwestlund
Copy link
Author

Your suggestion is for me to reimplement your mulitpart upload logic? Copying code out of a library just to change a buffer size is messy. The ability to bound minio-go's memory usage is a critical feature, IMHO.

I suggest making the constants configurable via the Client object. These days there are many different S3-compatible services (I'm using Digital Ocean's Spaces), and they likely have different size limits anyway.

@harshavardhana
Copy link
Member

No not at all. I am not talking about that at all. If you see I am not using low level APIs.

What I am suggesting is redesign your application.

@harshavardhana
Copy link
Member

Also notice here ComposeObject does a server side copy. So it is in fact quite beneficial and more efficient behavior for your requirement.

Size being -1 for PutObject and also coupled with not having even 576MiB is unusual requirement. Changing client API for this will lead to more confusion for our users for an uncommon use case, we are not going to do that.

So the above approach addresses your problem by letting you manage the buffer instead since you know the right buffer for your env and also allow the flexibility of having an upper bound for your total data transfer.

@rwestlund
Copy link
Author

Fair enough, and thanks for the example code. But I ran into a problem with it:

Error in upload-part-copy - A header you provided implies functionality that is not implemented

It looks like Digital Ocean's S3 implementation doesn't support object composition from multiple objects as you have it implemented in ComposeObject(). It's just as well, as your example would corrupt data if there are multiple parallel uploads anyway.

I need to implement this as an actual multipart upload (where you get an upload ID from S3). This means I'd make a custom version of putObjectMultipartStreamNoLength(). However the functions I need are not exported: newUploadID(), uploadPart(), abortMultipartUpload(), and completeMultipartUpload()

Do you have a suggestion for how to proceed?

@harshavardhana
Copy link
Member

It looks like Digital Ocean's S3 implementation doesn't support object composition from multiple objects as you have it implemented in ComposeObject(). It's just as well, as your example would corrupt data if there are multiple parallel uploads anyway.

CopyObjectPart is a proper locked operation so i don't see any storage implementation Minio or AWS S3 would implement replacing of a source object a racy situation while it is being read to be copied over on the other end @rwestlund. So this API is safe to be used even in concurrent situations.

Now that digital ocean doesn't implement it yet you have two options..

  • Fork minio-go and use it locally with the upper bound to be based not on 5TiB instead based on 1TiB. You can change this in optimalPartInfo() function in api-put-object-common.go.
  • Wait for digital ocean to implement all S3 API calls and use the ComposeObject.

@rwestlund
Copy link
Author

The problem I see in your example is with uploading each chunk as a normal object called "parts/1" and "parts/2", etc. If I started a second upload, it would overwrite "parts/1" while the first upload was still uploading "parts/2". When the first upload finally starts the server-side copy, it could get parts of the first file and parts of the second file mixed together.

I could generate a UUID for each upload to avoid object name collisions, but this is still a hacky way to upload a large file. S3's multipart API is there for a reason.

I'll check whether the AWS SDK or any of the smaller S3 libraries allow multipart uploads with bounded memory, then fork minio-go as a last resort.

@harshavardhana
Copy link
Member

That is just an example, you can indeed use a uuid. It is not a hacky way. It is pretty neat functionality used by some backup software solutions.

You can also look for minio-go#Core API to get direct access to multipart API if needed.

Feel free to close this issue, please reopen if you have further questions.

@rwestlund
Copy link
Author

Ah, the Core API is what I needed. Now I have multipart uploads with bounded memory :)

Thanks for your help, I really appreciate it.

@WGH-
Copy link

WGH- commented Aug 10, 2023

I think we don't have to allocate partSize bytes (which is enormous half a gigabyte by default) right away.

We can start with some smaller amount, like 512 bytes, and increase it as we go, just like io.ReadAll does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants