Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api-put-object-streaming: Removed unnecessary partBuf copy to save allocations; no race detected. #1673

Merged
merged 4 commits into from
Jul 6, 2022

Conversation

bwplotka
Copy link
Contributor

@bwplotka bwplotka commented Jul 5, 2022

Hi and thanks for the great work so far. 👋🏽

Following our discussion in minio/mc#3376 (comment) I performed extended testing, benchmarking and profiling to figure out what's best for minio-go and the Thanos issue.

Proving tests:

No matter how hard I tried, I could not reproduce ANY race the fix was made for. I even try to trigger retries, and all works fine for Linux *os.File. I tried to look for any repro tests on your side but none was provided to prove that there is indeed any race. Perhaps the race was detected on certain OS? (windows/MacOS?)

The benchmarks show significant allocation problems while copying large files (PartSize 64MB). I also optimize naive chunk preallocations.

benchstat v1.txt v2-upgr.txt
name       old time/op    new time/op    delta
Upload-12     300ms ± 7%     283ms ± 7%     ~     (p=0.310 n=5+5)

name       old alloc/op   new alloc/op   delta
Upload-12    69.3MB ± 0%     0.5MB ± 3%  -99.23%  (p=0.008 n=5+5)

name       old allocs/op  new allocs/op  delta
Upload-12     3.33k ± 2%     3.16k ± 1%   -5.01%  (p=0.008 n=5+5)

I know you reduced default PartSize to 16MB (from 128MB) to reduce default allocation spread, but this is still unacceptable and 16MB part size have other tradeoffs (latency). So in this PR I propose to revert that fix and optimize allocations. I would love to be e able to use minio-go in Thanos and related projects, but we need this path to be leaner. One of the components Thanos sidecar is usually deployed with limited mem (100-200 MB), so optimizations in this library matters.

Perhaps if you don't want to accept this PR because you are afraid of (IMO unproven) potential races, you would be happy with adding extra option to Put API like ReuseReader?

I am happy to be proven wrong that we indeed have raced here (e.g I did not test other OS than Linux), but we can't accept 60x allocation regression for unexisting race. Let me know what do you think, thanks! 🤗

This version is also proposed to be used in Thanos: thanos-io/thanos#5474

Fixes: thanos-io/thanos#3967
Fixes: thanos-io/thanos#5393

@harshavardhana
Copy link
Member

No matter how hard I tried, I could not reproduce ANY race the fix was made for. I even try to trigger retries, and all works fine for Linux *os.File. I tried to look for any repro tests on your side but none was provided to prove that there is indeed any race. Perhaps the race was detected on certain OS? (windows/MacOS?)

The race is basically a single file being Seeked at different locations in parallel - this is not supported by *os.File what you have done in this PR is mainly to avoid such parallel Seeks instead serialized them because you are not using the buffered
channel anymore.

return UploadInfo{}, uploadRes.Error
}

// Update the totalUploadedSize.
totalUploadedSize += uploadRes.Size
// Store the parts to be completed in order.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the comments @bwplotka

// Store the parts to be completed in upload order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this comment was quite confusing we don't store things in order, we only sort it later. Happy to make it clear in this comment. 🤗

@harshavardhana
Copy link
Member

The reproducer is very simple

package main

import (
	"fmt"
	"os"
	"strconv"
	"sync"

	"github.com/minio/minio-go/v7"
)

var (
	endpoint        = "localhost:9001"
	accessKeyID     = "some_key"
	secretAccessKey = "some_secret"
	useSSL          = false
	bName           = "test"
	bLocation       = ""
	toUpload        = "1GB_file.img"
	workers         = 5
)

// initializes MinIO client
func createMC() (*minio.Client, error) {
	mc, err := minio.New(
		endpoint,
		accessKeyID,
		secretAccessKey,
		useSSL,
	)

	return mc, err
}

// creates new bucket
func createBucket(mc *minio.Client) error {
	exists, err := mc.BucketExists(bName)
	if err == nil && exists {
		return nil
	} else if err != nil {
		return err
	}
	if err := mc.MakeBucket(bName, bLocation); err != nil {
		return err
	}
	return nil
}

func chanPopulate(ch chan<- string) {
	defer close(ch)
	for i := 0; i < 20; i++ {
		fmt.Printf("populating ch ID %d\n", i)
		ch <- toUpload
	}
}

func main() {
	mc, err := createMC()
	if err != nil {
		fmt.Printf("failed to create MinIO client, %v\n", err)
		os.Exit(1)
	}
	if err := createBucket(mc); err != nil {
		fmt.Printf("failed to create bucket, %v\n", err)
		os.Exit(1)
	}

	files := make(chan string, 5)
	go chanPopulate(files)

	var wg sync.WaitGroup
	for i := 1; i < 6; i++ {
		fmt.Printf("worker is starting ID %d\n", i)
		wg.Add(1)
		go func(i int) {
			defer wg.Done()
			fmt.Printf("worker started ID %d\n", i)
			for p := range files {
				fInfo, err := os.Stat(p)
				if err != nil {
					fmt.Printf("failed to stat file, path %s, %v\n", p, err)
					return
				}
				f, err := os.Open(p)
				if err != nil {
					fmt.Printf("failed to open file, path %s, %v\n", p, err)
					return
				}
				n, err := mc.PutObject(
					bName,
					toUpload+strconv.Itoa(i),
					f,
					fInfo.Size(),
					minio.PutObjectOptions{},
				)
				if err != nil {
					fmt.Printf("failed to upload file, path %s, %v\n", p, err)
					return
				}
				fmt.Printf("file uploaded, size '%d'\n", n)
			}
		}(i)
	}
	wg.Wait()
}

This change in the PR is basically not doing anything other than making the entire upload operation serial @bwplotka - what we wanted to essentially have was to be able to upload - x parts in parallel which we cannot do in your PR because

  • uploadPartsRes is unbuffered so it can only ever upload a part if the result side reads the result.
  • multiple parallel go-routines are not doing anything anymore since they are all blocked on uploadPartsRes

Of course, this is why you won't see the race at all.

Since the "length" of the stream must be known in S3 API there is no way to "stream", and we can't read random offsets and allow them to be "safely" accessed without serializing on the single file.

From what I can see parallel uploads of multiple parts within an object never really worked properly without buffering because of the S3's length requirement - so simply using io.NewSectionReader() doesn't preserve that.

@harshavardhana
Copy link
Member

Let me think about this more.

@bwplotka
Copy link
Contributor Author

bwplotka commented Jul 6, 2022

Thanks for even considering my PR, let's unpack this. 🤗

uploadPartsRes is unbuffered, so it can only ever upload a part if the result side reads the result.

That's incorrect, they are not blocked because the main goroutine actively consumes that into our slice:

image

Benchmarks prove it (no regression in latency for this PR)

Of course, this is why you won't see the race at all.

That's not the problem, there is concurrency here. I also tested against race on the exact commit message before your fix commit. There must be something different I do wrong if there is a race somewhere.

The race is basically a single file being Seeked at different locations in parallel - this is not supported by *os.File what you have done in this PR is mainly to avoid such parallel Seeks instead serialized them because you are not using the buffered channel anymore.

Again, non-buffering does not change anything here. Even with buffering, I could not reproduce the race you mentioned. Sorry for adding that extra complexity to the discussion. It's red herring here.

The true source of this problem is indeed Seek thread-safety. Go on Linux uses lseek syscall for this, which is prone to races if you depend on "current offset". But the truth is, we never use current offset in the current implementation thanks to ReaderAt and NewSectionReader which only uses ReadAt. You can confirm this with the following test:

package file

import (
	"crypto/rand"
	"io"
	"os"
	"path/filepath"
	"sync"
	"testing"

	"github.com/efficientgo/tools/core/pkg/errcapture"
	"github.com/efficientgo/tools/core/pkg/testutil"
	"github.com/pkg/errors"
)

func createTestInput(fn string, bytes int) (err error) {
	if err := os.MkdirAll(filepath.Dir(fn), os.ModePerm); err != nil {
		return err
	}

	f, err := os.OpenFile(fn, os.O_CREATE|os.O_WRONLY, os.ModePerm)
	if err != nil {
		return errors.Wrap(err, "open")
	}

	defer func() {
		if err != nil {
			errcapture.Do(&err, func() error { return os.Remove(fn) }, "remove failed file")
		}
	}()

	b := make([]byte, 10*1024*1024)
	for i := 0; i < bytes/len(b); i++ {

		if _, err := rand.Read(b); err != nil {
			return errors.Wrap(err, "read urandom")
		}
		if _, err := f.Write(b); err != nil {
			return err
		}
	}

	return f.Close()
}

const partSize int64 = 1 * 1024 * 1024

func readThreeSections(t *testing.T, f *os.File) ([]byte, []byte, []byte) {
	s1 := io.NewSectionReader(f, 20, partSize)
	s2 := io.NewSectionReader(f, 2*partSize, partSize)
	s3 := io.NewSectionReader(f, 10*partSize, partSize)

	s1read := make([]byte, partSize)
	_, err := s1.Read(s1read)
	testutil.Ok(t, err)

	s3read := make([]byte, partSize)
	_, err = s3.Read(s3read)
	testutil.Ok(t, err)

	s2read := make([]byte, partSize)
	_, err = s2.Read(s2read)
	testutil.Ok(t, err)

	return s1read, s2read, s3read
}

func TestFile_SectionReader(t *testing.T) {
	fn := filepath.Join(t.TempDir(), "test.file")
	testutil.Ok(t, createTestInput(fn, 100*1024*1024))

	f, err := os.Open(fn)
	testutil.Ok(t, err)

	s1read, s2read, s3read := readThreeSections(t, f)

	testutil.Assert(t, bytes.Compare(s1read, s2read) != 0)
	testutil.Assert(t, bytes.Compare(s2read, s3read) != 0)
	testutil.Assert(t, bytes.Compare(s1read, s3read) != 0)

	defer func() { testutil.Ok(t, f.Close()) }()

	// We can reuse same file descriptor for new section reads - they are not touching "current offset".

	wg := sync.WaitGroup{}
	wg.Add(100)
	for i := 0; i < 100; i++ {
		go func() {
			defer wg.Done()

			s1, s2, s3 := readThreeSections(t, f)
			testutil.Equals(t, s1read, s1)
			testutil.Equals(t, s2read, s2)
			testutil.Equals(t, s3read, s3)
		}()
	}
	wg.Wait()
}

Thanks for providing the repro you used back then, but does it actually trigger race warning on my PR? I will test it later, but my test is similar and it was fine. Interesting enough in my test I use exactly the same file descriptor which should trigger even more races if our code is racy. It proves to be fine though.

@klauspost
Copy link
Contributor

@bwplotka

ReadAt is concurrency safe on *os.File, but the io.ReaderAt interface does not mandate that this is safe for all implementations.

Therefore assuming that it is, can lead to subtle bugs.

@bwplotka
Copy link
Contributor Author

bwplotka commented Jul 6, 2022

Ack.

Would you accept an explicit option to reuse reader in options then? (:

@harshavardhana
Copy link
Member

harshavardhana commented Jul 6, 2022

Ack.

Would you accept an explicit option to reuse reader in options then? (:

what I would suggest instead of generic ReaderAt make that ReaderAt to be explicit *os.File - rest of them pay the penalty of being io.Reader instead since we can't reliably guarantee that concurrency is possible.

Copy link
Member

@harshavardhana harshavardhana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func (c *Client) putObjectMultipartStreamFromReadAt(ctx context.Context, bucketName, objectName string,
	reader *os.File, size int64, opts PutObjectOptions,

Just this change would provide the necessary guarantees

@harshavardhana
Copy link
Member

harshavardhana commented Jul 6, 2022

The actual issue because of a "retry" - I reproduced the problem even with this PR

==================
WARNING: DATA RACE
Write at 0x00c0003022b8 by goroutine 181:
  io.(*SectionReader).Seek()
      /home/harsha/sdk/go1.18.3/src/io/io.go:534 +0x165
  github.com/minio/minio-go/v7.(*Client).executeMethod()
      /home/harsha/go/src/github.com/minio/minio-go/api.go:580 +0x542
  github.com/minio/minio-go/v7.(*Client).uploadPart()
      /home/harsha/go/src/github.com/minio/minio-go/api-put-object-multipart.go:298 +0x4bb
  github.com/minio/minio-go/v7.(*Client).putObjectMultipartStreamFromReadAt.func3()
      /home/harsha/go/src/github.com/minio/minio-go/api-put-object-streaming.go:187 +0x695
  github.com/minio/minio-go/v7.(*Client).putObjectMultipartStreamFromReadAt.func4()
      /home/harsha/go/src/github.com/minio/minio-go/api-put-object-streaming.go:211 +0x47

Previous write at 0x00c0003022b8 by goroutine 150:
  ??()
      -:0 +0xffffffffffffffff
  time.Time.AppendFormat()
      /home/harsha/sdk/go1.18.3/src/time/format.go:617 +0x1d5
  time.Time.Format()
      /home/harsha/sdk/go1.18.3/src/time/format.go:597 +0xe4
  github.com/minio/minio-go/v7/pkg/signer.buildChunkStringToSign()
      /home/harsha/go/src/github.com/minio/minio-go/pkg/signer/request-signature-streaming.go:84 +0x92
  github.com/minio/minio-go/v7/pkg/signer.buildChunkSignature()
      /home/harsha/go/src/github.com/minio/minio-go/pkg/signer/request-signature-streaming.go:119 +0xd1
  github.com/minio/minio-go/v7/pkg/signer.(*StreamingReader).signChunk()
      /home/harsha/go/src/github.com/minio/minio-go/pkg/signer/request-signature-streaming.go:164 +0x1b0
  github.com/minio/minio-go/v7/pkg/signer.(*StreamingReader).Read()
      /home/harsha/go/src/github.com/minio/minio-go/pkg/signer/request-signature-streaming.go:275 +0x778
  io.(*LimitedReader).Read()
      /home/harsha/sdk/go1.18.3/src/io/io.go:476 +0xc5
  io.copyBuffer()
      /home/harsha/sdk/go1.18.3/src/io/io.go:426 +0x28a
  io.Copy()
      /home/harsha/sdk/go1.18.3/src/io/io.go:385 +0x8d
  net.genericReadFrom()
      /home/harsha/sdk/go1.18.3/src/net/net.go:662 +0x28
  net.(*TCPConn).readFrom()
      /home/harsha/sdk/go1.18.3/src/net/tcpsock_posix.go:54 +0xcc
  net.(*TCPConn).ReadFrom()
      /home/harsha/sdk/go1.18.3/src/net/tcpsock.go:130 +0x68
  io.copyBuffer()
      /home/harsha/sdk/go1.18.3/src/io/io.go:412 +0x1c2
  io.Copy()
      /home/harsha/sdk/go1.18.3/src/io/io.go:385 +0x86
  net/http.persistConnWriter.ReadFrom()
      /home/harsha/sdk/go1.18.3/src/net/http/transport.go:1775 +0x18
  bufio.(*Writer).ReadFrom()
      /home/harsha/sdk/go1.18.3/src/bufio/bufio.go:768 +0x28d
  io.copyBuffer()
      /home/harsha/sdk/go1.18.3/src/io/io.go:412 +0x1c2
  io.Copy()
      /home/harsha/sdk/go1.18.3/src/io/io.go:385 +0x64
  net/http.(*transferWriter).doBodyCopy()
      /home/harsha/sdk/go1.18.3/src/net/http/transfer.go:411 +0x3e
  net/http.(*transferWriter).writeBody()
      /home/harsha/sdk/go1.18.3/src/net/http/transfer.go:369 +0x799
  net/http.(*Request).write()
      /home/harsha/sdk/go1.18.3/src/net/http/request.go:698 +0x1197
  net/http.(*persistConn).writeLoop()
      /home/harsha/sdk/go1.18.3/src/net/http/transport.go:2395 +0x2e4
  net/http.(*Transport).dialConn.func6()
      /home/harsha/sdk/go1.18.3/src/net/http/transport.go:1751 +0x39

Goroutine 181 (running) created at:
  github.com/minio/minio-go/v7.(*Client).putObjectMultipartStreamFromReadAt()
      /home/harsha/go/src/github.com/minio/minio-go/api-put-object-streaming.go:158 +0x73e
  github.com/minio/minio-go/v7.(*Client).putObjectMultipartStream()
      /home/harsha/go/src/github.com/minio/minio-go/api-put-object-streaming.go:49 +0x2ba
  github.com/minio/minio-go/v7.(*Client).putObjectCommon()
      /home/harsha/go/src/github.com/minio/minio-go/api-put-object.go:278 +0x8b7
  github.com/minio/minio-go/v7.(*Client).PutObject()
      /home/harsha/go/src/github.com/minio/minio-go/api-put-object.go:244 +0x324
  main.main.func1()
      /home/harsha/go/src/github.com/minio/minio-go/race-test.go:95 +0x433
  main.main.func3()
      /home/harsha/go/src/github.com/minio/minio-go/race-test.go:111 +0x47
==================

You are not able to reproduce this because you are not seeing the issue with retrying that is
triggered when one of the parallel uploads fails - since there is a Seek() call to re-upload, that
is inherently racy @bwplotka

To avoid that Seek we have to make sure that Seeking is not allowed if you are going to
use ReaderAt without buffering.

Instead, the Seek should be converted into a ReaderAt call again to avoid offset in parallel
or add a locked "SectionReader"

I reproduced it by taking the server up and down multiple times - this triggers the race.

@bwplotka
Copy link
Contributor Author

bwplotka commented Jul 6, 2022

Ack. I manually injected fault to "executeMethod" to trigger the race, but unsuccessfully.

Thanks for repro, it seems we know what we are looking for now then!

@harshavardhana
Copy link
Member

Let me fix it in your PR and send a PR to your fork - we can then merge it.

Copy link
Member

@harshavardhana harshavardhana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and tested

@harshavardhana harshavardhana merged commit 618e300 into minio:master Jul 6, 2022
@bwplotka
Copy link
Contributor Author

bwplotka commented Jul 7, 2022

Amazing 😱

You are the best all! <3

@bwplotka
Copy link
Contributor Author

bwplotka commented Jul 7, 2022

FYI: PR to remove fork thanos-io/thanos#5474

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Help minio-go with over-allocations on Multi-Part Upload Use original Minio upstream for Minio client
3 participants