Uploading small files of unknown size uses too much memory #848

rwestlund · 2017-10-18T01:27:22Z

Version

minio-go 3.0.3
FreeBSD 11.1/amd64
Go 1.9

Background

My app accepts multipart file uploads over HTTP (thus the file size can not be known) and uses this library to stream them directly into an S3-compatible store. I run it on a small VPS with 512 MiB RAM and a tiny hard drive. The purpose of streaming them directly into S3 is to prevent buffering large files in RAM or on disk.

Expected Behavior

I should be able to stream small files of unknown size into S3 without using lots of memory.

Actual Behavior

When minio-go starts uploading a file of unknown size, optimalPartInfo(-1) assumes the worst case of needing to fit a 5 TiB file within 10000 chunks and allocates a 576 MiB buffer.

The Problem

Allocating this massive buffer causes my server to grind to a halt for 60 seconds while the page daemon swaps to make room for this large allocation that can't fit in RAM. This is silly when I know I'll never handle a file larger than 10 GiB.

Temporary Workaround

Fork and modify constants.go.

Solutions

The constants could be configurable, and belong to the Client struct.
The new PutObjectOptions struct you're adding could have a field for the maximum buffer size or maximum expected file size.

The text was updated successfully, but these errors were encountered:

harshavardhana · 2017-10-18T01:30:46Z

Please use master to override the part size by providing the right size or use the min part size override.

Or for now if you wish to use 3.0.3 - fork the code and change it locally.

This is done to ensure that caller can control the amount of memory being used when uploading a stream without knowing prior length. Fixes minio#848

rwestlund · 2017-10-19T17:02:52Z

As stated above, the file size is unknown.

I can't find anything in the documentation about a min part size override. What do you mean?

harshavardhana · 2017-10-19T17:06:34Z

File size cannot be unknown in S3 .We just allow it so that we can buffer and upload as a convenience. You shouldn't build an application around this behavior when you don't have enough RAM.

In any case there is a PR submitted to allow this to be configurable.. so that the overall buffer limit can be set manually.

NOTE: this will also change the total size of the object you can upload.

rwestlund · 2017-10-19T20:31:22Z

Thanks for the PR, that looks perfect.

harshavardhana · 2017-10-20T01:38:44Z

We are not going to fix in the way mentioned by PR #849 instead i closed it. Here is a way to achieve what you want in a more idiomatic way.

package main

import (
        "bytes"
        "fmt"
        "io"
        "log"
        "os"

        "github.com/cheggaaa/pb"
        minio "github.com/minio/minio-go"
)

func main() {
        s3Client, err := minio.New("play.minio.io:9000", "Q3AM3UQ867SPQQA43P2F", "zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG", true)
        if err != nil {
                log.Fatalln(err)
        }

        var (
                totalParts = 10000
                partSize   = int64(10 * 1024 * 1024)
                i          = 0
        )
        var srcs = make([]minio.SourceInfo, 10000)

        progress := pb.New64(-1)
        progress.Start()

        var buf = make([]byte, partSize)
        for i < totalParts {
                length, rErr := io.ReadFull(os.Stdin, buf)
                if rErr == io.EOF && i > 1 {
                        break
                }
                if rErr != nil && rErr != io.ErrUnexpectedEOF {
                        log.Fatalln(rErr)
                }

                s3Client.TraceOn(os.Stdout)
                if _, err = s3Client.PutObject("testbucket", fmt.Sprintf("parts/%d", i), bytes.NewReader(buf[:length]), int64(len(buf[:length])),
                        minio.PutObjectOptions{
                                Progress: progress,
                        }); err != nil {
                        log.Fatalln(err)
                }

                // Source objects to concatenate.
                srcs[i] = minio.NewSourceInfo("testbucket", fmt.Sprintf("parts/%d", i), nil)
                i++
        }

        // Create destination info
        dst, err := minio.NewDestinationInfo("testbucket", "final-object", nil, nil)
        if err != nil {
                log.Fatalln(err)
        }

        if err = s3Client.ComposeObject(dst, srcs[:i]); err != nil {
                log.Fatalln(err)
        }

        objectsCh := make(chan string)

        // Send object names that are needed to be removed to objectsCh
        go func() {
                defer close(objectsCh)
                for j := 0; j < i+1; j++ {
                        objectsCh <- fmt.Sprintf("parts/%d", j)
                }
        }()

        // Call RemoveObjects API
        for e := range s3Client.RemoveObjects("my-bucketname", objectsCh) {
                log.Fatalln("Failed to remove " + e.ObjectName + ", error: " + e.Err.Error())
        }

}

rwestlund · 2017-10-20T02:46:43Z

Your suggestion is for me to reimplement your mulitpart upload logic? Copying code out of a library just to change a buffer size is messy. The ability to bound minio-go's memory usage is a critical feature, IMHO.

I suggest making the constants configurable via the Client object. These days there are many different S3-compatible services (I'm using Digital Ocean's Spaces), and they likely have different size limits anyway.

harshavardhana · 2017-10-20T04:17:57Z

No not at all. I am not talking about that at all. If you see I am not using low level APIs.

What I am suggesting is redesign your application.

harshavardhana · 2017-10-20T04:27:33Z

Also notice here ComposeObject does a server side copy. So it is in fact quite beneficial and more efficient behavior for your requirement.

Size being -1 for PutObject and also coupled with not having even 576MiB is unusual requirement. Changing client API for this will lead to more confusion for our users for an uncommon use case, we are not going to do that.

So the above approach addresses your problem by letting you manage the buffer instead since you know the right buffer for your env and also allow the flexibility of having an upper bound for your total data transfer.

rwestlund · 2017-10-21T01:38:58Z

Fair enough, and thanks for the example code. But I ran into a problem with it:

Error in upload-part-copy - A header you provided implies functionality that is not implemented

It looks like Digital Ocean's S3 implementation doesn't support object composition from multiple objects as you have it implemented in ComposeObject(). It's just as well, as your example would corrupt data if there are multiple parallel uploads anyway.

I need to implement this as an actual multipart upload (where you get an upload ID from S3). This means I'd make a custom version of putObjectMultipartStreamNoLength(). However the functions I need are not exported: newUploadID(), uploadPart(), abortMultipartUpload(), and completeMultipartUpload()

Do you have a suggestion for how to proceed?

harshavardhana · 2017-10-21T01:50:16Z

It looks like Digital Ocean's S3 implementation doesn't support object composition from multiple objects as you have it implemented in ComposeObject(). It's just as well, as your example would corrupt data if there are multiple parallel uploads anyway.

CopyObjectPart is a proper locked operation so i don't see any storage implementation Minio or AWS S3 would implement replacing of a source object a racy situation while it is being read to be copied over on the other end @rwestlund. So this API is safe to be used even in concurrent situations.

Now that digital ocean doesn't implement it yet you have two options..

Fork minio-go and use it locally with the upper bound to be based not on 5TiB instead based on 1TiB. You can change this in optimalPartInfo() function in api-put-object-common.go.
Wait for digital ocean to implement all S3 API calls and use the ComposeObject.

rwestlund · 2017-10-21T02:16:18Z

The problem I see in your example is with uploading each chunk as a normal object called "parts/1" and "parts/2", etc. If I started a second upload, it would overwrite "parts/1" while the first upload was still uploading "parts/2". When the first upload finally starts the server-side copy, it could get parts of the first file and parts of the second file mixed together.

I could generate a UUID for each upload to avoid object name collisions, but this is still a hacky way to upload a large file. S3's multipart API is there for a reason.

I'll check whether the AWS SDK or any of the smaller S3 libraries allow multipart uploads with bounded memory, then fork minio-go as a last resort.

harshavardhana · 2017-10-21T02:26:51Z

That is just an example, you can indeed use a uuid. It is not a hacky way. It is pretty neat functionality used by some backup software solutions.

You can also look for minio-go#Core API to get direct access to multipart API if needed.

Feel free to close this issue, please reopen if you have further questions.

rwestlund · 2017-10-21T03:51:58Z

Ah, the Core API is what I needed. Now I have multipart uploads with bounded memory :)

Thanks for your help, I really appreciate it.

WGH- · 2023-08-10T13:17:20Z

I think we don't have to allocate partSize bytes (which is enormous half a gigabyte by default) right away.

We can start with some smaller amount, like 512 bytes, and increase it as we go, just like io.ReadAll does.

harshavardhana mentioned this issue Oct 18, 2017

Add PartSize to be configurable as part of PutObjectOptions #849

Closed

kannappanr assigned harshavardhana Oct 18, 2017

kannappanr added the priority: medium label Oct 18, 2017

kannappanr added this to the Current milestone Oct 18, 2017

harshavardhana added the wontfix label Oct 20, 2017

rwestlund closed this as completed Oct 21, 2017

gviedma-zz mentioned this issue Jan 10, 2018

Replace minio-go with aws-sdk-go for s3-compatible log backend fnproject/fn#670

Merged

lieut-data mentioned this issue May 18, 2018

Attachment upload crashes Mattermost - Out of Memory mattermost/mattermost#8741

Closed

esethna mentioned this issue May 22, 2018

[Help Wanted] [MM-10677] Optimize S3 file uploads mattermost/mattermost#8838

Closed

adrien-f mentioned this issue Dec 9, 2018

store/s3: try to guess size to enable multipart uploads thanos-io/thanos#617

Merged

marwan-at-work mentioned this issue Mar 4, 2019

Generalizing upload code for minio gomods/athens#1035

Closed

marpio mentioned this issue Mar 5, 2019

Memory leak with Minio storage gomods/athens#1106

Closed

deepthi mentioned this issue Apr 12, 2019

Improve vttablet memory consumption with ceph backups vitessio/vitess#4804

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uploading small files of unknown size uses too much memory #848

Uploading small files of unknown size uses too much memory #848

rwestlund commented Oct 18, 2017

harshavardhana commented Oct 18, 2017

rwestlund commented Oct 19, 2017

harshavardhana commented Oct 19, 2017 •

edited

Loading

rwestlund commented Oct 19, 2017

harshavardhana commented Oct 20, 2017

rwestlund commented Oct 20, 2017

harshavardhana commented Oct 20, 2017

harshavardhana commented Oct 20, 2017

rwestlund commented Oct 21, 2017

harshavardhana commented Oct 21, 2017

rwestlund commented Oct 21, 2017

harshavardhana commented Oct 21, 2017

rwestlund commented Oct 21, 2017

WGH- commented Aug 10, 2023

Uploading small files of unknown size uses too much memory #848

Uploading small files of unknown size uses too much memory #848

Comments

rwestlund commented Oct 18, 2017

Version

Background

Expected Behavior

Actual Behavior

The Problem

Temporary Workaround

Solutions

harshavardhana commented Oct 18, 2017

rwestlund commented Oct 19, 2017

harshavardhana commented Oct 19, 2017 • edited Loading

rwestlund commented Oct 19, 2017

harshavardhana commented Oct 20, 2017

rwestlund commented Oct 20, 2017

harshavardhana commented Oct 20, 2017

harshavardhana commented Oct 20, 2017

rwestlund commented Oct 21, 2017

harshavardhana commented Oct 21, 2017

rwestlund commented Oct 21, 2017

harshavardhana commented Oct 21, 2017

rwestlund commented Oct 21, 2017

WGH- commented Aug 10, 2023

harshavardhana commented Oct 19, 2017 •

edited

Loading