S3 client falling over after 4-5 hours of processing 200 files per minute #2657

JamieClayton7 · 2024-05-28T08:48:33Z

Acknowledgements

I have searched (https://github.com/aws/aws-sdk/issues?q=is%3Aissue) for past instances of this issue
I have verified all of my SDK modules are up-to-date (you can perform a bulk update with go get -u github.com/aws/aws-sdk-go-v2/...)

Describe the bug

My application polls SQS for messages generated by S3 event notification on ObjectCreated. The object is pulled using GetObject, we do some processing of the ndjson file and re-upload (PutObject) the file to the bucket with updated contents. This consistently works fine for 4-5 hours on average until the client seems to fall over, with every request failing with the following error message:

api error RequestTimeout: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.

If I restart the service, we are back to working correctly. The same client is being used across a maximum of 50 goroutines. The service is running in AWS ECS Fargate with Go version 1.22.3.

Edit: For context - each file is ~2MB.

Expected Behavior

The S3 client to consistently PutObject without error.

Current Behavior

After a period of time (4-5 hours on average), the client begins to throw the same error for every PutObject operation:

api error RequestTimeout: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.

Reproduction Steps

const (
	OBJECT_CREATED_STATUS = "status"
	UPDATED = "updated"
)

type S3Updater[T interface{}] struct {
	*s3.Client
	FileConverter[T]
	Bucket *string
	Key *string
}

func NewS3Updater[T interface{}](client *s3.Client, fileConverter FileConverter[T]) *S3Updater[T] {
	return &S3Updater[T]{
		Client: client,
		FileConverter: fileConverter,
	}
}

func (updater S3Updater[T]) Update(ctx context.Context, val *[]T, opts ...func (Updater[T])) error {
	for _, opt := range opts {
		opt(&updater)
	}
	if updater.Bucket == nil || updater.Key == nil {
		return errors.New("invalid s3 parameters provided")
	}
	file, err := updater.FileConverter(val)
	if err != nil {
		return err
	} 
	ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
	defer cancel()	
	_, err = updater.PutObject(ctx, &s3.PutObjectInput{
		Bucket: updater.Bucket,
		Key: updater.Key,
		Body: aws.ReadSeekCloser(file),
		Metadata: map[string]string{
			OBJECT_CREATED_STATUS: UPDATED,
		},
	})
	if err != nil {
		return err
	}
	return nil
}

Possible Solution

No response

Additional Information/Context

No response

AWS Go SDK V2 Module Versions Used

github.com/aws/aws-sdk-go-v2/service/s3 v1.54.2
github.com/aws/aws-sdk-go-v2 v1.27.0

Compiler and Version used

1.22.3

Operating System and version

Alpine 3.20 https://hub.docker.com/layers/library/alpine/3.20/images/sha256-216266c86fc4dcef5619930bd394245824c2af52fd21ba7c6fa0e618657d4c3b

The text was updated successfully, but these errors were encountered:

RanVaknin · 2024-05-28T18:14:22Z

Hi @JamieClayton7 ,

From reading about it in similar issues on github

aws/aws-sdk-go#411
aws/aws-sdk-js#281
aws/aws-sdk-php#29
aws/aws-cli#401
aws/aws-sdk-ruby#241

It seems like this error is being thrown when the content-length sent with the failing requests does not match the actual content length of body of the request.

I'm not sure what fileConverter does in this case, and its not clear what data type file is in this case. I don't see its definition or the relevant import, but my guess is that is that its the culprit here. I suggest you try and ensure the integrity and consistency of the file before and after the conversion process. Logging the size and content of the file at different stages might help identify discrepancies. Lastly, I advise you to enable the Go SDK logger to inspect the raw request sent (try to identify the logs only for the failing requests.

You can enable the SDK logger like so:

	cfg, err := config.LoadDefaultConfig(
		context.TODO(),
		config.WithRegion("us-east-1"), // or any other region
		config.WithClientLogMode(aws.LogRequestWithBody|aws.LogResponseWithBody),
	)
	if err != nil {
		log.Fatal(err)
	}

Thanks,
Ran~

JamieClayton7 · 2024-05-29T08:25:27Z

Thanks @RanVaknin

As aws.ReadSeekCloser function takes an io.Reader as it's argument, it should be clear what the type of the body is. I have changed the client options to RetryMaxAttempts = 10 and I haven't seen the error again. That suggests the would-be failing PutObject calls are being retried more than the SDK default maximum retry limit.

If the first X files we upload work correctly, and the remaining uploads work correctly after a brute force of retries, it suggests there is an issue server-side. The FileConverter interface impl is as follows to just double check that I'm not missing something:

func ConvertNdJson[T interface{}](val *[]T) (io.Reader, error) {
	b := make([]byte, 0)
	w := bytes.NewBuffer(b)
	enc := json.NewEncoder(w)
	for _, item := range *val {
		if err := enc.Encode(item); err != nil {
			return nil, err
		}
	}
	return w, nil
}

RanVaknin · 2024-06-03T22:23:11Z

Hi @JamieClayton7 ,

Thanks for the additional info. Based on what you mentioned wrt retries, it does sound like an intermittent service related issue.

In order for this to be investigated by the S3 service team, I'll need a request ID of one of the failing requests. You can see the request ID if you enable the response logging.

Thanks,
Ran~

JamieClayton7 · 2024-06-05T08:25:40Z

@RanVaknin

Here are various failing response bodies with codes:

<Error><Code>RequestTimeout</Code><Message>Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.</Message><RequestId>KREFE2A3HA14QZFM</RequestId><HostId>oRYyCgf21gxAp3X0jnChw0qVhacR4qdwcjOI5cO7AsCSPn0o7DdhZ8Gufihyv5KeszGjY1qpU/s=</HostId></Error>

<Error><Code>RequestTimeout</Code><Message>Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.</Message><RequestId>3BFJ0PA40R8DKHZA</RequestId><HostId>DYueP8YqPCDnwv9iwpdD5eNLStnRgydSoI1xjuVncirFzSQ3BSa/+6w3Gntg1YZJ9KqPvoKDhXVW2zhs7cpAvw==</HostId></Error>

<Error><Code>RequestTimeout</Code><Message>Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.</Message><RequestId>RBP53G4DDET8ZNYJ</RequestId><HostId>TQjjp7AS0pO4rjo4fhPRF8zcaFri9WlLgYzfr72C8QMG3v12TTp0IZmrRQV0QoISc5jUwnD/CtA=</HostId></Error>

JamieClayton7 · 2024-06-24T07:26:21Z

@RanVaknin is there any update to this? I've had to push a monkey patch to production to restart the container upon RequestTimeout to get around this problem.

RanVaknin · 2024-08-15T17:42:04Z

Hi @JamieClayton7 ,

Thanks for your patience. I had to synthesize all of the similar issues from the other SDKs, to make this ticket actionable by the S3 team. We have an internal ticket up with S3 (internal ticket #P148550293). If you have access to AWS Support via the developer console, you could open a ticket yourself and reference the ticket I created to provide more data directly to the S3 team.

I will do my best to keep you updated on the matter.

Thanks,
Ran~

tuananh · 2024-09-12T08:52:11Z

i also face this issue, the request id is SD8632ECNY3JT5GP if it's of any help.

JamieClayton7 added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels May 28, 2024

RanVaknin self-assigned this May 28, 2024

RanVaknin added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels May 28, 2024

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label May 30, 2024

lucix-aws added the needs-reproduction This issue needs reproduction. label Jun 3, 2024

RanVaknin added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed needs-reproduction This issue needs reproduction. labels Jun 3, 2024

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Jun 6, 2024

RanVaknin added investigating This issue is being investigated and/or work is in progress to resolve the issue. service-api This issue is due to a problem in a service API, not the SDK implementation. labels Jul 11, 2024

RanVaknin removed the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Aug 15, 2024

pdumoulin mentioned this issue Aug 29, 2024

transfer upload batch of files always fails aws/aws-sdk-php#2992

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 client falling over after 4-5 hours of processing 200 files per minute #2657

S3 client falling over after 4-5 hours of processing 200 files per minute #2657

JamieClayton7 commented May 28, 2024 •

edited

Loading

RanVaknin commented May 28, 2024

JamieClayton7 commented May 29, 2024 •

edited

Loading

RanVaknin commented Jun 3, 2024

JamieClayton7 commented Jun 5, 2024 •

edited

Loading

JamieClayton7 commented Jun 24, 2024

RanVaknin commented Aug 15, 2024

tuananh commented Sep 12, 2024

S3 client falling over after 4-5 hours of processing 200 files per minute #2657

S3 client falling over after 4-5 hours of processing 200 files per minute #2657

Comments

JamieClayton7 commented May 28, 2024 • edited Loading

Acknowledgements

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

AWS Go SDK V2 Module Versions Used

Compiler and Version used

Operating System and version

RanVaknin commented May 28, 2024

JamieClayton7 commented May 29, 2024 • edited Loading

RanVaknin commented Jun 3, 2024

JamieClayton7 commented Jun 5, 2024 • edited Loading

JamieClayton7 commented Jun 24, 2024

RanVaknin commented Aug 15, 2024

tuananh commented Sep 12, 2024

JamieClayton7 commented May 28, 2024 •

edited

Loading

JamieClayton7 commented May 29, 2024 •

edited

Loading

JamieClayton7 commented Jun 5, 2024 •

edited

Loading